class: center, middle, inverse, title-slide # Poisson and Dicrete Uniform Distributions ## + Lots of review ### Dr. Dogucu ### 2019-10-21 --- layout: true <div class="my-header"></div> <div class="my-footer"> Copyright © <a href="https://mdogucu.ics.uci.edu">Dr. Mine Dogucu</a>. All Rights Reserved.</div> --- ## Announcements - Note that you already have a homework due on 10/25. You will not be assigned a homework on 10/25 that would normally be due 11/01. Instead you will be provided midterm review questions. These questions will not be graded. You are encouraged to study in groups. --- ## FAQ - Line Break in Rmd This is the first sentence. This is the second sentence. This is the first sentence. This is the second sentence. <img src = "img/line-break.png" class="center-image"> </img> Even though these two paragraphs look similar in their .Rmd backend, there is a slight difference. The second paragraph has a line break. At the end of the first sentence, you can hit the space bar twice before hitting enter to make a line break. --- ## Review - Sampling - A good sample is __representative__ of the population. - In simple random sampling every element of the sample has an equal chance of being selected. - There are other sampling methods that help us get _representative_ samples. --- ## Review - Sample Statistics For some samples we have looked at some statistics. For instance we have looked at mean differences of domestic gross of movies that passed the Bechdel test and those that did not. We looked at independence of gun ownership and being born in this country based on General Social Survey. Note that so far we only talked about samples. We will learn how to make __inference__ about the population after the midterm. --- ## Review - Data Visualization ```r glimpse(cats) ``` ``` ## Observations: 144 ## Variables: 3 ## $ Sex <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F... ## $ Bwt <dbl> 2.0, 2.0, 2.0, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1... ## $ Hwt <dbl> 7.0, 7.4, 9.5, 7.2, 7.3, 7.6, 8.1, 8.2, 8.3, 8.5, 8.7, 9.8... ``` The `cats` data frame has two variables of interest `Bwt` and `Hwt` which are Birth weight and Heart Weight of cats respectively. If we wanted to examine the relationship between `Bwt` and `Hwt` what `geom_object()` could we utilize? In other words, what kind of plot could we make? --- ## Review - Data Visualization ```r cats %>% ggplot(aes(x = Bwt, y = Hwt)) + geom_point() + theme(text = element_text(size=30)) ``` <img src="slide-4a-pois-uni_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Review - Data Visualization <img src="slide-4a-pois-uni_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> This is a histogram of class of passengers on Titanic. True or False? --- ## Review - Data Visualization <img src="slide-4a-pois-uni_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> False! It is a bar plot. Categorical variables are best represented by bar plots. Even though class labels may look 1, 2, 3 numeric they are categories. --- ## Review - Data Visualization For single categorical variable: bar plot For two categorical variables: stacked bar plots For single numeric variable: histogram, dot plot, box plot For two numeric variables: scatter plot For a categorical variable and a numeric variable: side-by-side box plot There are many other ways to visualize data. Your `ggplot` cheat sheets have many other `geom_objects()` --- ## Review - pmf and cdf __pmf__: `\(f_X(x) = P(X = x)\)` __cdf__: `\(F_X(x) = P(X \leq x)\)` --- ## Review - pmf and cdf Let `\(X\)` be a random variable with a probability mass function `\(f(x)\)` and cumulative distribution function `\(F(x)\)`. Fill in the blanks in the table. | x | 1 | 2 | 3 | 4 | |-------- |------ |------ |------ |--- | | `\(f(x)\)` | 0.23 | ? | 0.15 | ? | | `\(F(x)\)` | ? | 0.36 | ? | ? | --- ## Review - pmf and cdf | x | 1 | 2 | 3 | 4 | |-------- |------ |------ |------ |------ | | `\(f(x)\)` | 0.23 | 0.13 | 0.15 | 0.49 | | `\(F(x)\)` | 0.23 | 0.36 | 0.51 | 1 | --- ## Numeric Variables in R ```r age <- 21 typeof(age) ``` ``` ## [1] "double" ``` ```r str(age) ``` ``` ## num 21 ``` --- ## Numeric Variables in R ```r temp <- 72.25 typeof(temp) ``` ``` ## [1] "double" ``` ```r str(temp) ``` ``` ## num 72.2 ``` --- ## Numeric Variables in R ```r age2 <- 21L typeof(age2) ``` ``` ## [1] "integer" ``` ```r str(age2) ``` ``` ## int 21 ``` --- ## Categorical Variables in R ```r outcome <- c("heads", "tails", "tails", "tails", "heads") str(outcome) ``` ``` ## chr [1:5] "heads" "tails" "tails" "tails" "heads" ``` ```r outcome <- factor(outcome) str(outcome) ``` ``` ## Factor w/ 2 levels "heads","tails": 1 2 2 2 1 ``` Characters signify that the values are nonnumeric. Factor signifies that it is a categorical vector and also shows that there are two categories(levels). --- ## Categorical Variables in R ```r response <- c("Extremely Disagree", "Disagree", "Agree", "Extremely Agree") response <- factor(response, order = TRUE) str(response) ``` ``` ## Ord.factor w/ 4 levels "Agree"<"Disagree"<..: 4 2 1 3 ``` --- ## Categorical Variables in R Let's fix the ordering ```r response <- factor(response, order = TRUE, levels = c("Extremely Disagree", "Disagree", "Agree", "Extremely Agree")) str(response) ``` ``` ## Ord.factor w/ 4 levels "Extremely Disagree"<..: 1 2 3 4 ``` --- ## FAQ - When do we use factor()? Data does not always come in perfect shape for us ready to use for our analysis. Consider the variables in the `titanic` dataset. Can you spot some that have wrong types? ```r glimpse(titanic_train) ``` ``` ## Observations: 891 ## Variables: 12 ## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,... ## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,... ## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex <chr> "male", "female", "female", "female", "male", "mal... ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ... ## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,... ## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,... ## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ... ## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ... ## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ... ``` --- ## FAQ - When do we use factor()? ```r titanic_train %>% ggplot(aes(x = Pclass)) + geom_bar() ``` <img src="slide-4a-pois-uni_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> Even though `Pclass` is defined as an integer, we could still use `geom_bar()`. --- ## FAQ - When do we use factor()? ```r titanic_train %>% ggplot(aes(x = Pclass, y = Fare)) + geom_boxplot() ``` ``` ## Warning: Continuous x aesthetic -- did you forget aes(group=...)? ``` ![](slide-4a-pois-uni_files/figure-html/unnamed-chunk-17-1.png)<!-- --> What is wrong with this plot? --- ## FAQ - When do we use factor()? ```r titanic_train %>% ggplot(aes(x = factor(Pclass), y = Fare)) + geom_boxplot() ``` ![](slide-4a-pois-uni_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ## A different solution ```r titanic_train %>% mutate(Pclass = as.factor(Pclass)) %>% ggplot(aes(x = Pclass, y = Fare)) + geom_boxplot() ``` ![](slide-4a-pois-uni_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- ## `mutate()` We use mutate to create new variables or over-write the old ones. Why do you think the following code still shows `Pclass` as an integer even though we mutated it to be a factor? ```r glimpse(titanic_train) ``` ``` ## Observations: 891 ## Variables: 12 ## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,... ## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,... ## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex <chr> "male", "female", "female", "female", "male", "mal... ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ... ## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,... ## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,... ## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ... ## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ... ## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ... ``` --- ## `mutate()` ```r titanic_train <- titanic_train %>% mutate(Pclass = as.factor(Pclass)) glimpse(titanic_train) ``` ``` ## Observations: 891 ## Variables: 12 ## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,... ## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,... ## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex <chr> "male", "female", "female", "female", "male", "mal... ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ... ## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,... ## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,... ## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ... ## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ... ## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ... ``` --- ## Review - Binomial vs. Geometric __Binomial__ The probability of `\(k\)` successes in `\(n\)` trials. `\(\Omega_k = \{0, 1, 2, 3, ... n\}\)` __Geometric__ The probability that the success will be in the `\(n\)`th trial for the first time. `\(\Omega_n = \{1, 2, 3, 4, ...\}\)` --- ## Review - Binomial vs. Geometric | | Binomial | Geometric | |----- |------------ |----------- | | pmf | `dbinom()` | `dgeom()` | | cdf | `pbinom()` | `pgeom()` | --- ## Review It is estimated that 45.3% of emails we receive are spam [(Statista)](https://www.statista.com/statistics/420400/spam-email-traffic-share-annual/). Mohsen has received 30 emails today. 1) How many of these emails would you expect to be spam? Call this value `\(\mu\)`. 2) If Mohsen was to receive 30 emails every day would you expect `\(\mu\)` many spam email every time? Is there variance? Calculate. 3) Pause and think. Of the 30 emails he received, what is the probability that 29 are spam? 4) What is the probability that the number of spam emails he received is less than 9? --- ## Review 1) How many of these emails would you expect to be spam? Call this value `\(\mu\)`. `\(E(X) = n p = 30 \times 0.453 = 13.59\)` ```r n <- 30 p <- 0.453 mu <- n*p mu ``` ``` ## [1] 13.59 ``` --- ## Review 2) If Mohsen was to receive 30 emails every day would you expect `\(\mu\)` many spam email every time? Is there variance? Calculate. No. The expected value represents an average number of spam emails (successes) if we were to repeat the random process (receiving 30 emails) many times. `\(Var(X) = n p (1-p) = 30 \times 0.453 \times (1-0.453) = 7.43373\)` ```r sigma2 <- n*p*(1-p) sigma2 ``` ``` ## [1] 7.43373 ``` --- ## Review 3) Pause and think. Of the 30 emails he received, what is the probability that 29 are spam? `\(P(k = 29) = {30 \choose 29}(0.453)^{29} (1-0.453)^1\)` ```r dbinom(x = 29, size = 30, prob = 0.453) ``` ``` ## [1] 1.745647e-09 ``` ```r dbinom(x = 29, size = n, prob = p) ``` ``` ## [1] 1.745647e-09 ``` --- ## Review 4) What is the probability that the number of spam emails he received is less than 9? `\(P(k<9) = P(k\leq8)\)` so this is a cdf question. `\(= P(k =0) + P(k=1) +.... P(k=7) +P(k=8)\)` ```r pbinom(q = 8, size = 30, prob = 0.453) ``` ``` ## [1] 0.02898265 ``` ```r pbinom(q = 8, size = n, prob = p) ``` ``` ## [1] 0.02898265 ``` --- ## Discrete Uniform Distribution In discrete uniform distribution, all values are equally likely to be observed. A fair roll die follows a uniform distribution. `\(X\sim unif(a,b)\)` pmf: `\(P(X=x) = \frac{1}{n}\)` `\(\Omega_X = \{a, a+1, ..., b-1, b\}\)` --- ## Discrete Uniform Distribution A broken phone alarm starts ringing once every day randomly. Let X be the random variable representing the hour on the 24 hour system when the alarm goes on. Then `\(\Omega_X = \{0, 1, 2, ...,22,23\}\)` `\(X\sim unif(0,23)\)` The probability that the alarm goes sometime between 15:00 - 16:00. `\(P(X = 15) = \frac{1}{24} = 0.04167\)` --- ## Discrete Uniform Distribution Note that the R function `dunif()` is used to calculate probability density function of continuous uniform distributions and __not__ for probability mass function of discrete uniform distributions. --- ## Discrete Uniform Distribution R is an open-source language. That means any body can contribute to R. Luckily someone wanted to contribute with a discrete uniform probability mass function. However, this function is not installed when R is installed. --- ## R packages When we download R or open RStdudio cloud, we have R functions that we can use such as `dbinom()`. These already installed set of functions are part of __base R__. The rest comes from the open source community who develop functions, datasets and provide it for the public use in R packages. For instance the `ddunif()` function comes from the `fitur` package. --- ## Using R packages In order to use a package, you first have to make sure it is downloaded. ```r install.packages("fitur") ``` You only do installation process once. Once it is downloaded it will always be downloaded to your computer (RStudio Cloud is slightly different). --- ## Using R packages Once a package is downloaded you can use any function or object provided within the package. In order to use the `ddunif()` function from `fitur` package we have two options. Option 1: ```r library(fitur) ddunif(x = 15, min = 0, max = 23) ``` ``` ## [1] 0.04166667 ``` Use this option if you are going to keep using functions from the same package over and over again. `library(fitur)` and all packages you load should be at the top of your code. --- ## Using R packages Option 2: ```r fitur::ddunif(x = 15, min = 0, max = 23) ``` ``` ## [1] 0.04166667 ``` Use this option if you are going to use functions from this package only few times. Also, this is a good option if you are unfamiliar with a package and you want to remind yourself which package the function belongs to. --- ## Poisson Distribution `\(X \sim Poisson (\lambda)\)` pmf: `\(P(k) = \frac{\lambda^k}{k!} e^{-\lambda}\)` where `\(k\)` is number of observed events. Note: The number of events in a __fixed__ time or space. `\(\Omega_x = \{0,1,2....\}\)` `\(E(X) = Var(X) = \lambda\)` --- ## Poisson Distribution At an emergency clinic, 12 patients arrive on average per hour. What is the probability that 8 patients will arrive in the next hour? `\(P(k = 8) = \frac{12^8}{8!}e^{-12}=\)` ```r 12^8/factorial(8)*exp(-12) ``` ``` ## [1] 0.06552328 ``` ```r dpois(x = 8, lambda = 12) ``` ``` ## [1] 0.06552328 ``` --- ## Poisson Distribution What is the probability that 20 patients will arrive in the next hour? `\(P(X = 20)\)` ```r dpois(x = 20, lambda = 12) ``` ``` ## [1] 0.009682032 ``` --- ## Poisson Distribution What is the probability that less than 20 patients will arrive in the next hour? `\(P(X < 20) = P(X \leq 19)\)` ```r ppois(q = 19, lambda =12) ``` ``` ## [1] 0.9787202 ```