Sampling and Study Design

Dr. Mine Dogucu

Sampling

Research question

Every research project aims to answer a research question (or multiple questions).

Example

Do UCI students who exercise regularly have higher GPA?

We will use this research question throughout the examples in the lecture.

Population

Each research question aims to examine a population.

Example

Population for this research question is UCI students.

Data Collection

Data are collected to answer research questions. There are different methods to collect data. For instance, data can be collected

in-person or online (if collecting from human subjects)
on-site or off-site (e.g. rain fall measures vs. moon image tracking)
with different tools such as surveys, motion sensors (e.g. marathon finish lines)

Data Collection - Ethics

When collecting data from human and animal research subjects we need to consider ethics.

In universities, rights of the human and animal research subjects are protected by the Institutional Review Board (IRB) of each university. If interested (highly recommended) you can read about UCI’s Institutional Review Board)

Example

Consider that we design a survey with the following questions to study the research question.

Do you exercise at least once every week?
What is your GPA?

Sampling

A population is a collection of elements which the research question aims to study. However it is often costly and sometimes impossible to study the whole population. Often a subset of the population is selected to be studied. Sample is the the subset of the population that is studied. The goal is to have a sample that is representative of the population so that the findings of the study can generalize to the population.

Example

Since it would be almost impossible to give the survey to ALL UCI students, we can give it to a sample of students.

There are different sampling methods to consider.

Convenience (Availability) Sampling

Convenience sampling occurs when a specific sample is selected because the sample is easy to access.

Example

Stand in front of Langson Library
Give the survey to 100 UCI students

This could introduce (sampling) bias and the findings may not generalize to the population. It is possible that those in front of the library

may study more and thus may have higher GPA.
may be more active than those who study at home/dorm.

Additional Example

A scientist is interested in counting the number of different species of bacteria in San Diego Creek. She takes a bucket of water from San Diego Creek where she happens to be standing and counts the different specifies of bacteria. The bacteria in the bucket make up the sample and the bacteria in San Diego Creek make up the population. The scientist is using the convenience sampling method.

Simple Random Sample

When simple random sampling technique is used any element of the population has an equal chance of being selected to the sample.

Example

The researcher can

reach out to the registrar to get student emails;
randomly select 100 students;
email them the survey.

Assume that the 100 selected students respond.

Population: All UCI students
Sample: 100 students who have responded

Simple Random Sampling in R

1sample(1:100, 3, replace = FALSE)

1: This allows us to sample 3 numbers from 1 to 100 without replacement, meaning a number can only be selected once.

[1] 67 36 10

To generalize:

sample(x = 1:N, size = n, replace = FALSE)

This code will take a random sample of size \(n\) from the population consisting of the numbers in the interval \([1, N]\)

Side note: This is not truly random but that’s beyond the scope of this class. Here is a fun (short) reading about it. Philosophers also discuss if true randomness exists or not.

Non-response Bias

Even when simple random sampling is used, if participants are unwilling to participate in studies then the results can have nonresponse bias.

Example

It is unlikely that 100 students will respond. Assume that 86 respond.

It is possible that those 14 who did not respond

may be busy exercising and did not have the time to respond.
may be busy studying and did not have the time to respond.

Additional Example

A social media company shows a survey to some its users on the timeline. Many users ignore the survey and do not take it. There is a high non-response rate and thus the results cannot be generalized to the population.

Quiz Question

Cluster Sampling

In cluster sampling the population is divided into group (i.e, clusters). The sample consists of elements in randomly selected clusters.

Example

The researchers may get a list of classes taught at UCI. They randomly select 10 classes. All the students in those 10 classes will be in the sample.

Stratified Sampling

In stratified sampling the population is first divided into groups (i.e., stratas) and then the sample is selected randomly within each strata.

Example

The researchers suspect that exercising patterns might be different across different class years. Thus they want to make sure that the sample includes first-years, sophomores, juniors, and seniors. They get a list of students with class year information from the registrar. They then randomly select 25 students who are first years, 25 sophomores, 25 juniors, and 25 seniors.

Study Design

Anecdotal Evidence

Anecdotal evidence is an observation that is not systematic and haphazard.

Example

We might meet a junior student who got 100 points in all UCI exams, homework assignments, and quizzes that they have taken and they say that they exercise regularly. Even though the data are factually correct (i.e., high GPA and regular exercise routine.) this does not

Anecdotal evidence is not a scientific method to answer research questions. We need rigorously designed studies to make generalizations and/or to establish causal relationships.

Observational Study

In observational studies, researchers study the research question without exposing the cases (or subset of a sample) to any treatment or intervention. In observational studies causal relationships between variables cannot be established.

Example

Based on the survey, even if we observe that UCI students who exercise regularly have higher GPA, we cannot conclude that exercising regularly increases GPA.

Relationship between two variables

If two variables are related to each other in some way we would call them associated.

If two variables are not related to each other in any way we would call them independent.

Relationship between two variables

When we examine the relationship between two variables, we often want to know if the relationship between them is causal. In other words, does one variable cause the other? For instance, is exercising the reason for higher GPA? We don’t know!

When we suspect that two variables have a causal relationship we can say

The explanatory variable (e.g. exercising) might causally affect the response variable (e.g. GPA).

Relationship between two variables does not imply one causes the other.

Relationship between two variables

Explanatory variables are denoted by \(x\) and the response variable is denoted by \(y\). You can remember this from eXplanatory variable is \(x\). Exercising may eXplain high GPA.

A confounding variable (e.g. time management skills) has a correlation with the the explanatory and the response variable.

Experiment Design

In experiments, researchers assign cases to treatments/interventions.

In randomized experiments, researchers randomly assign cases to treatments/interventions. In order to establish causal link between variables, we need randomized experiments.

Example

~~Do UCI students who exercise regularly have higher GPA?~~

Does exercising regularly increase GPA for UCI students?

Image Copyright Derenik Haghverdian. Used with permission

Note

Random sampling and random assignment (i.e., random allocation) serve different purposes.

Random sampling

method of choosing sample from the population
the goal is to establish generalizability

Random allocation

method of assigning the sample to different treatment groups
the goal is to establish causality.

Blocking

A doctor has developed a drug called drug i.d.s. to treat some disease. She wants to know if patients who take drug i.d.s. is free of the disease for at least a year.

The doctor suspects that the drug may affect adults and kids differently.

If researchers suspect that the an additional variable that may influence the response variable then they may use blocks.

Blocking

Image Copyright Federica Ricci. Used with permission.

A/B testing

A/B testing is a randomized experiment that compares two versions (A and B) of a single variable.
It is commonly used on measuring online activities such as revenue per user, click through rates for online ads, number of returning users.

More Vocabulary about Experiments

A placebo is a fake treatment. If a patient shows an improvement by taking a placebo then this is called a placebo effect.

In blind studies, patients do not know what treatment they receive. In double blind studies patients who receive and the doctors who provide the treatment do not know the type of the treatment.

Simpson’s Paradox

Simpson’s Paradox - UC Berkeley Admissions, 1973

Simpson’s Paradox

If we observe a certain trend between two variables and this effect disappears or reverses when a third variable is introduced then this phenomenon Simpson’s paradox.

More examples on Wikipedia

Moral of the Story

We need to move beyond thinking about the relationship between just two variables. We need to keep asking if there are/could be any confounding variables.

Writing Research Questions Using Notation

If there was no variance there would be no need for statistics.

What if?

We want to understand average number of sleep Irvine residents get. What if everyone in Irvine slept 8 hours every night? (sleep = {8, 8,…, 8})
We want to predict who will graduate college. What if everyone graduated college? (graduate = {TRUE, TRUE,…, TRUE})

We want to understand if Android users spend more time on their phones when compared to iOS users. What if everyone spent 3 hours per day on their phones? (time = {3, 3,…, 3}, os = {Android, Android, …. iOS})
We want to understand, if birth height and weight are positively associated in babies. What if every baby was 7.5 lbs? (weight = {7.5, 7.5,…, 7.5}, height = {20, 22,…, 18})

In all these fake scenarios there would be no variance in sleep, graduate, time, weight. These variables would all be constants thus would not even be a variable.

Things vary. We use statistics in research studies to understand how variables vary and often we want to know how they covary with other variables.

To make the connection between research questions of studies and statistics, we will take small steps and begin with writing research questions using notation.

Research Question Do UCI students sleep on average 8 hours on a typical night?

Variable sleep (8,7,9,7.5, …)

Research Question Using Notation \(\mu \stackrel{?}{=} 8\)

The parameter we want to infer about is a single mean.

Tip

If you want to type math notation correctly on Gradescope or Quarto out correctly as \(\mu\) then you can write

$$\mu$$

The double dollar signs at the beginning and at the end let Gradescope know that you are writing a math equation.

Research Question Do the majority of Americans approve allowing DACA immigrants to become citizens?

Variable approve (yes, yes, yes, no, yes, no, no)

Research Question Using Notation \(\pi \stackrel{?}{>} 0.5\)

The parameter we want to infer about is a single proportion.

Research Question Is California March 2020 unemployment rate different than US March 2020 unemployment rate which is at 4.4%?

Variable unemployed_CA (no, no, yes, no, yes, no, no…)

Research Question Using Notation \(\pi \stackrel{?}{=} 0.044\)

The parameter we want to infer about is a single proportion.

Research Question Are there more STEM majors at UCI than non-STEM majors?

Variable STEM (TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE…)

Research Question Using Notation \(\pi_{STEM} \stackrel{?}{>} 0.5\)

The parameter we want to infer about is a single proportion.

RQ Do STEM (s) majors have higher or lower (different) income after graduation when compared to non-STEM (n) majors?

Variables explanatory: STEM (TRUE, FALSE, FALSE, TRUE,…)
response: income(40000, 20000, 65490, 115000,…)

Research Question Using Notation \(\mu_{s} \stackrel{?}{=} \mu_{n}\) or \(\mu_{s} - \mu_{n} \stackrel{?}{=}0\)

We want to infer about difference of two means.

RQ Do Democrats and Republicans approve legal abortion at same rates?

Variables explanatory: party (D, D, R, R,…)
response: approve(TRUE, FALSE, FALSE, TRUE,…)

Research Question Using Notation \(\pi_{d} \stackrel{?}{=} \pi_{r}\) or \(\pi_{d} - \pi_{r} \stackrel{?}{=}0\)

We want to infer about difference of two proportions.

	Parameter of Interest	Response	Explanatory
Single Mean	\(\mu\)	Numeric
Difference of Two Means	\(\mu_1 - \mu_2\)	Numeric	Binary
Single Proportion	\(\pi\)	Binary
Difference of Two Proportions	\(\pi_1 - \pi_2\)	Binary	Binary

A categorical variable with two levels is called a binary variable.

Later on we will also learn

Parameter of Interest	Response	Explanatory
\(\beta_1\)	Numeric	Categorical and/or Numeric