Hierarchical Linear Models with No Predictors

Dr. Mine Dogucu

The notes for this lecture are derived from Chapters 15 and 16 of the Bayes Rules! book

Data - Running

data(cherry_blossom_sample)

running <- cherry_blossom_sample %>% 
  select(runner, age, net)
glimpse(running)

Rows: 252
Columns: 3
$ runner <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, …
$ age    <int> 53, 54, 55, 56, 57, 58, 59, 52, 53, 54, 55, 56, 58, 59, 51, 52,…
$ net    <dbl> 83.98333, 74.30000, 75.15000, 74.21667, 74.25000, NA, NA, 82.66…

ggplot(running, aes(x = runner, y = net)) + 
  geom_boxplot()

Complete Pooling

Complete pooling technique: combine all 252 observations across our 36 runners into one pool of information.

ggplot(running, aes(y = net, x = age)) + 
  geom_point()

Complete Pooling

\[Y_i | \beta_0, \beta_1, \sigma \stackrel{ind}{\sim} N\left(\mu_i, \sigma^2\right) \;\; \text{ with } \;\; \mu_i = \beta_0 + \beta_1X_i\]

complete_pooled_model <- stan_glm(
  net ~ age, 
  data = running, family = gaussian, 
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, refresh = 0, seed = 84735)

tidy(complete_pooled_model, conf.int = TRUE, conf.level = 0.80)

# A tibble: 2 × 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   75.2      24.6     43.7     106.   
2 age            0.268     0.446   -0.302     0.842

# Plot of the posterior median model
ggplot(running, aes(x = age, y = net, group = runner)) + 
  geom_smooth(method = "lm", se = FALSE, color = "gray", size = 0.5) + 
  geom_abline(aes(intercept = 75.2, slope = 0.268), color = "blue")

# Select an example subset
examples <- running %>% 
  filter(runner %in% c("1", "20", "22"))

ggplot(examples, aes(x = age, y = net)) + 
  geom_point() + 
  facet_wrap(~ runner) + 
  geom_abline(aes(intercept = 75.2242, slope = 0.2678), 
              color = "blue")

Though the observations on one runner might be independent of those on another, the observations within a runner are correlated. That is, how fast a runner ran in their previous race tells us something about how fast they’ll run in the next.
With respect to the relationship between running time and age, people are inherently different.

Framework of a complete pooled model:

Drawbacks of a complete pooling approach:

we violate the assumption of independence; and, in turn,
we might produce misleading conclusions about the relationship itself and the significance of this relationship.

No pooling

No pooling approach builds a separate model for each runner.

Let \((Y_{ij}, X_{ij})\) denote the observed run times and age for runner \(j\) in their \(i\)th race. Then the data structure for the Normal linear regression model of run time vs age for runner \(j\) is:

\[Y_{ij} | \beta_{0j}, \beta_{1j}, \sigma \sim N\left(\mu_{ij}, \sigma^2\right) \;\; \text{ with } \;\; \mu_{ij} = \beta_{0j} + \beta_{1j} X_{ij}\]

This model allows for each runner \(j\) to have a unique intercept \(\beta_{0j}\) and age coefficient \(\beta_{1j}\).

On the context of running, the no pooled models reflect the fact that some people tend to be faster than others (hence the different \(\beta_{0j}\)) and that changes in speed over time aren’t the same for everyone (hence the different \(\beta_{1j}\)).

Based on this model, what do you anticipate that your running time will be at the age of 55?

Drawbacks of a no pooling approach:

We cannot reliably generalize or apply the group-specific no pooled models to groups outside those in our sample.
No pooled models assume that one group doesn’t contain relevant information about another, and thus ignores potentially valuable information. This is especially consequential when we have a small number of observations per group.

Partial Pooling

Examples: Students within classrooms, patients within hospitals, different runs for each runner (longitudunal, repeated-measures)

Within-group variability

The degree of the variability among multiple observations within each group can be interesting on its own. For example, we can examine how consistent an individual’s running times are from year to year.

Between-group variability

Hierarchical data also allows us to examine the variability from group to group. For example, we can examine the degree to which running patterns vary from individual to individual.

data(spotify)

spotify <- spotify %>% 
  select(artist, title, popularity) %>% 
  mutate(artist = fct_reorder(artist, popularity, .fun = 'mean'))

glimpse(spotify)

Rows: 350
Columns: 3
$ artist     <fct> Alok, Alok, Alok, Alok, Alok, Alok, Alok, Alok, Alok, Alok,…
$ title      <chr> "On & On", "All The Lies", "Hear Me Now", "The Wall", "Hear…
$ popularity <dbl> 79, 56, 75, 65, 52, 45, 79, 61, 61, 61, 56, 71, 61, 79, 69,…

nlevels(spotify$artist)

[1] 44

artist_means <- spotify %>% 
  group_by(artist) %>% 
  summarize(count = n(), popularity = mean(popularity))

artist_means %>%
  slice(1:2, 43:44)

# A tibble: 4 × 3
  artist        count popularity
  <fct>         <int>      <dbl>
1 Mia X             4       13.2
2 Chris Goldarg    10       16.4
3 Lil Skies         3       79.3
4 Camilo            9       81

Complete pooling

Ignore artists and lump all songs together

No pooling

Separately analyze each artist and assume that one artist’s data doesn’t contain valuable information about another artist

Partial pooling (via hierarchical models)

Acknowledge the grouping structure, so that even though artists differ in popularity, they might share valuable information about each other and about the broader population of artists.

The hierarchy

Layer 1:

\(Y_{ij} | \mu_j, \sigma_y \hspace{-0.075in} \sim \text{model of how song popularity varies WITHIN artist } j\)

Layer 2:

\(\mu_j | \mu, \sigma_\mu \hspace{-0.075in} \sim \text{model of how the typical popularity} \mu_j \text{varies BETWEEN artists}\)

Layer 3:

\(\mu, \sigma_y, \sigma_\mu \hspace{-0.075in} \sim \text{prior models for shared global parameters}\)

\(\mu_j\) = mean song popularity for artist \(j\); and
\(\sigma_y\) = within-group variability, i.e., the standard deviation in popularity from song to song within each artist.

Popularity varies from artist to artist. We model this variability in mean popularity between artists by assuming that the individual mean popularity levels, \(\mu_j\), are Normally distributed around \(\mu\) with standard deviation \(\sigma_\mu\)

\[\mu_j | \mu, \sigma_\mu \sim N(\mu, \sigma^2_\mu) .\]

\(\mu\) = the global average of mean song popularity \(\mu_j\) across all artists \(j\), i.e., the mean popularity rating for the most average artist; and
\(\sigma_\mu\) = between-group variability, i.e., the standard deviation in mean popularity \(\mu_j\) from artist to artist.

ggplot(artist_means, aes(x = popularity)) + 
  geom_density()

Notation alert

There’s a difference between \(\mu_j\) and \(\mu\). When a parameter has a subscript \(j\), it refers to a feature of group \(j\). When a parameter doesn’t have a subscript \(j\), it’s the global counterpart, i.e., the same feature across all groups.
Subscripts signal the group or layer of interest. For example, \(\sigma_y\) refers to the standard deviation of \(Y\) values within each group, whereas \(\sigma_\mu\) refers to the standard deviation of means \(\mu_j\) from group to group.

spotify_hierarchical <- stan_glmer(
  popularity ~ (1 | artist), 
  data = spotify, family = gaussian,
  prior_intercept = normal(50, 2.5, autoscale = TRUE),
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735, refresh=FALSE)

To indicate that the artist variable defines the group structure of our data, as opposed to it being a predictor of popularity, the appropriate formula here is popularity ~ (1 | artist).
The prior for \(\sigma_\mu\) is specified by prior_covariance. For this particular model, with only one set of artist-specific parameters \(\mu_j\), this is equivalent to an Exp(1) prior. (We will learn more about prior_covariance next lecture).

pp_check(spotify_hierarchical) + 
  xlab("popularity")

# Store the simulation in a data frame
spotify_hierarchical_df <- as.data.frame(spotify_hierarchical)

# Check out the first 3 and last 3 parameter labels
spotify_hierarchical_df %>% 
  colnames() %>% 
  as.data.frame() %>% 
  slice(1:3, 45:47)

                                      .
1                           (Intercept)
2           b[(Intercept) artist:Mia_X]
3   b[(Intercept) artist:Chris_Goldarg]
4          b[(Intercept) artist:Camilo]
5                                 sigma
6 Sigma[artist:(Intercept),(Intercept)]

Posterior Analysis of Global Parameters

\(\mu\) = (Intercept)
\(\sigma_y\) = sigma
\(\sigma_\mu^2\) = Sigma[artist:(Intercept),(Intercept)].This is not a typo. The default output gives us information about the standard deviation within artists ( \(\sigma_y\) ) but the variance between artists ( \(\sigma_\mu^2\) ).

tidy(spotify_hierarchical, effects = "fixed", 
     conf.int = TRUE, conf.level = 0.80)

# A tibble: 1 × 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)     52.4      2.48     49.2      55.6

Pay attention to effects = fixed, where “fixed” is synonymous with “non-varying” or “global.”

Per the results, there’s an 80% chance that the average artist has a mean popularity rating between 49.2 and 55.6.

To call up the posterior medians for \(\sigma_y\) and \(\sigma_\mu\), we can specify effects = "ran_pars", i.e., parameters related to randomness or variability:

tidy(spotify_hierarchical, effects = "ran_pars")

# A tibble: 2 × 3
  term                    group    estimate
  <chr>                   <chr>       <dbl>
1 sd_(Intercept).artist   artist       15.2
2 sd_Observation.Residual Residual     14.0

The posterior median of \(\sigma_y\) (sd_Obervation.Residual) suggests that, within any given artist, popularity ratings tend to vary by 14 points from song to song. The between standard deviation \(\sigma_\mu\) (sd_(Intercept).artist) tends to be slightly higher at around 15.2. Thus, the mean popularity rating tends to vary by 15.2 points from artist to artist.

proportion of \(\text{Var}(Y_{ij})\) that can be explained by differences in the observations within each group:

\[\frac{\sigma^2_y}{\sigma^2_\mu + \sigma^2_y}\]

proportion of \(\text{Var}(Y_{ij})\)that can be explained by differences between groups

\[\frac{\sigma^2_\mu}{\sigma^2_\mu + \sigma^2_y}\]

These two sources of variability suggest that the popularity levels among multiple songs by the same artist tend to have a moderate correlation near 0.54.

15.1^2 / (15.1^2 + 14.0^2)

[1] 0.5377468

Thinking of this another way, 54% of the variability in song popularity is explained by differences between artists, whereas 46% is explained by differences among the songs within each artist:

14.0^2 / (15.1^2 + 14.0^2)

[1] 0.4622532

Posterior analysis of group-specific parameters

\[\mu_j = \mu + b_j \]

Here, \(b_j\) describes the difference between artist \(j\)’s mean popularity and the global mean popularity.

artist_summary <- tidy(spotify_hierarchical, effects = "ran_vals", 
                       conf.int = TRUE, conf.level = 0.80)
# Check out the results for the first & last 2 artists
artist_summary %>% 
  select(level, conf.low, conf.high) %>% 
  slice(1:2, 43:44)

# A tibble: 4 × 3
  level         conf.low conf.high
  <chr>            <dbl>     <dbl>
1 Mia_X            -40.8     -23.1
2 Chris_Goldarg    -39.3     -26.7
3 Lil_Skies         11.1      30.4
4 Camilo            19.5      32.6

There’s an 80% chance that Camilo’s mean popularity rating is between 19.5 and 32.6 above that of the average artist.

\[\mu_j = \mu + b_j = \text{(Intercept) + b[(Intercept) artist:j]} .\]

# Get MCMC chains for each mu_j
artist_chains <- spotify_hierarchical %>%
  tidybayes::spread_draws(`(Intercept)`, b[,artist]) %>% 
  mutate(mu_j = `(Intercept)` + b) 
# Check it out
artist_chains %>% 
  select(artist, `(Intercept)`, b, mu_j) %>% 
  head(4)

# A tibble: 4 × 4
# Groups:   artist [4]
  artist              `(Intercept)`     b  mu_j
  <chr>                       <dbl> <dbl> <dbl>
1 artist:Alok                  50.1 11.4   61.5
2 artist:Atlas_Genius          50.1 -9.81  40.2
3 artist:Au/Ra                 50.1 12.3   62.4
4 artist:Beyoncé               50.1 22.2   72.3

# Get posterior summaries for mu_j
artist_summary_scaled <- artist_chains %>% 
  select(-`(Intercept)`, -b) %>% 
  tidybayes::mean_qi(.width = 0.80) %>% 
  mutate(artist = fct_reorder(artist, mu_j))

# Check out the results
artist_summary_scaled %>% 
  select(artist, mu_j, .lower, .upper) %>% 
  head(4)

# A tibble: 4 × 4
  artist               mu_j .lower .upper
  <fct>               <dbl>  <dbl>  <dbl>
1 artist:Alok          64.3   60.2   68.3
2 artist:Atlas_Genius  47.0   38.9   55.1
3 artist:Au/Ra         59.5   52.0   67.1
4 artist:BUNT.         44.7   35.6   53.8

ggplot(artist_summary_scaled, 
       aes(x = artist, y = mu_j, ymin = .lower, ymax = .upper)) +
  geom_pointrange() +
  xaxis_text(angle = 90, hjust = 1)

artist_means %>% 
  filter(artist %in% c("Frank Ocean", "Lil Skies"))

# A tibble: 2 × 3
  artist      count popularity
  <fct>       <int>      <dbl>
1 Frank Ocean    40       69.8
2 Lil Skies       3       79.3

Our posterior understanding of Frank Ocean is based on 40 songs, the most of any artist in the dataset, we have only 3 songs for Lil Skies. We naturally have greater posterior certainty about Frank Ocean’s popularity, and hence narrower intervals.

Posterior Prediction

First consider the posterior prediction for an observed group or artist, Frank Ocean, the \(j\) = 39th artist in our sample.

The first layer of our hierarchical model holds the key in this situation: it assumes that the popularity of individual Frank Ocean songs are Normally distributed around his own mean popularity level \(\mu_j\) with standard deviation \(\sigma_y\).

Thus, to approximate the posterior predictive model for the popularity of Ocean’s next song on Spotify, we can simulate a prediction from the Layer 1 model evaluated at each of the 20,000 MCMC parameter sets \(\left\lbrace \mu_j^{(i)}, \sigma_y^{(i)}\right\rbrace\):

\[\begin{equation} Y_{\text{new,j}}^{(i)} | \mu_j, \sigma_y \; \sim \; N\left(\mu_j^{(i)}, \left(\sigma_y^{(i)}\right)^2\right). \end{equation}\]

The resulting predictions \(\left\lbrace Y_{\text{new},j}^{(1)}, Y_{\text{new},j}^{(2)}, \ldots, Y_{\text{new},j}^{(20000)} \right\rbrace\) and corresponding posterior predictive model will reflect two sources of variability, and hence uncertainty, in the popularity of Ocean’s next song:

within-group sampling variability in \(Y\), i.e., not all of Ocean’s songs are equally popular; and
posterior variability in the model parameters \(\mu_j\) and \(\sigma_y\), i.e., the underlying mean and variability in popularity across Ocean’s songs are unknown and can, themselves, vary.

# Simulate Ocean's posterior predictive model
set.seed(84735)
ocean_chains <- spotify_hierarchical_df %>%
  rename(b = `b[(Intercept) artist:Frank_Ocean]`) %>% 
  select(`(Intercept)`, b, sigma) %>% 
  mutate(mu_ocean = `(Intercept)` + b,
         y_ocean = rnorm(20000, mean = mu_ocean, sd = sigma))

# Check it out
head(ocean_chains, 3)

  (Intercept)        b    sigma mu_ocean  y_ocean
1    50.05874 18.01862 13.53963 68.07736 77.11144
2    53.35882 17.93030 13.36078 71.28912 69.63452
3    51.43166 15.88972 14.78483 67.32138 78.86637

# Posterior summary of Y_new,j
ocean_chains %>% 
  tidybayes::mean_qi(y_ocean, .width = 0.80)

# A tibble: 1 × 6
  y_ocean .lower .upper .width .point .interval
    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1    69.4   51.3   87.4    0.8 mean   qi

# Posterior summary of mu_j
artist_summary_scaled %>% 
  filter(artist == "artist:Frank_Ocean")

# A tibble: 1 × 7
  artist              mu_j .lower .upper .width .point .interval
  <fct>              <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1 artist:Frank_Ocean  69.4   66.5   72.2    0.8 mean   qi

Next consider posterior prediction for a yet unobserved group, Mohsen Beats.

No observed songs for Mohsen Beats means that we do not have any information about their mean popularity \(\mu_j\), and thus can’t take the same approach as we did for Ocean.

What we do know is this: (1) Mohsen Beats is an artist within the broader population of artists, (2) mean popularity levels among these artists are Normally distributed around some global mean \(\mu\) with between-artist standard deviation \(\sigma_\mu\) (Layer 2), and (3) our 44 sampled artists have informed our posterior understanding of this broader population. Then to approximate the posterior predictive model for the popularity of Mohsen Beats’ next song, we can simulate 20,000 predictions \(\left\lbrace Y_{\text{new},\text{mohsen}}^{(1)}, Y_{\text{new},\text{mohsen}}^{(2)}, \ldots, Y_{\text{new},\text{mohsen}}^{(20000)} \right\rbrace\) through a two-step process:

Step 1: Simulate a potential mean popularity level \(\mu_{\text{mohsen}}\) for Mohsen Beats by drawing from the Layer 2 model evaluated at each MCMC parameter set \(\left\lbrace \mu^{(i)}, \sigma_\mu^{(i)}\right\rbrace\):

\[\mu_{\text{mohsen}}^{(i)} | \mu, \sigma_\mu \; \sim \; N\left(\mu^{(i)}, \left(\sigma_\mu^{(i)}\right)^2\right).\]

Step 2: Simulate a prediction of song popularity \(Y_{\text{new},\text{mohsen}}\) from the Layer 1 model evaluated at each MCMC parameter set \(\left\lbrace \mu_{\text{mohsen}}^{(i)}, \sigma_y^{(i)}\right\rbrace\):

\[Y_{\text{new},\text{mohsen}}^{(i)} | \mu_{\text{mohsen}}, \sigma_y \; \sim \; N\left(\mu_{\text{mohsen}}^{(i)}, \left(\sigma_y^{(i)}\right)^2\right).\]

The additional step in our Mohsen Beats posterior prediction process reflects a third source of variability. When predicting song popularity for a new group, we must account for:

within-group sampling variability in \(Y\), i.e., not all of Mohsen Beats’ songs are equally popular;
between-group sampling variability in \(\mu_j\), i.e., not all artists are equally popular; and
posterior variability in the global model parameters \((\sigma_y, \mu, \sigma_\mu)\).

We’re able to predict with 80% posterior certainty that their next song will have a popularity rating somewhere between 25.66 and 78.6:

set.seed(84735)
mohsen_chains <- spotify_hierarchical_df %>%
  mutate(sigma_mu = sqrt(`Sigma[artist:(Intercept),(Intercept)]`),
         mu_mohsen = rnorm(20000, `(Intercept)`, sigma_mu),
         y_mohsen = rnorm(20000, mu_mohsen, sigma))
# Posterior predictive summaries
mohsen_chains %>% 
  mean_qi(y_mohsen, .width = 0.80)

# A tibble: 1 × 6
  y_mohsen .lower .upper .width .point .interval
     <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1     52.2   25.7   78.6    0.8 mean   qi

We can replicate the “by hand” simulations for Frank Ocean and Mohsen Beats using the posterior_predict() shortcut function.

set.seed(84735)
prediction_shortcut <- posterior_predict(
  spotify_hierarchical,
  newdata = data.frame(artist = c("Frank Ocean", "Mohsen Beats")))
# Posterior predictive model plots
mcmc_areas(prediction_shortcut, prob = 0.8) +
  ggplot2::scale_y_discrete(labels = c("Frank Ocean", "Mohsen Beats"))

set.seed(84735)
predictions_hierarchical <- posterior_predict(spotify_hierarchical, 
                                              newdata = artist_means)

# Posterior predictive plots
ppc_intervals(artist_means$popularity, yrep = predictions_hierarchical, 
              prob_outer = 0.80) +
  ggplot2::scale_x_continuous(labels = artist_means$artist, 
                              breaks = 1:nrow(artist_means)) +
  xaxis_text(angle = 90, hjust = 1) + 
  geom_hline(yintercept = 58.4, linetype = "dashed")

Shrinkage refers to the phenomenon in which the group-specific local trends in a hierarchical model are pulled or shrunk toward the global trends.

When utilizing weakly informative priors, the posterior mean predictions of song popularity from the hierarchical model are (roughly) weighted averages of those from the complete pooled ( \(\overline{y}_{\text{global}}\) ) and no pooled ( \(\overline{y}_j\) ) models:

\[\frac{\sigma^2_y}{\sigma^2_y + n_j \sigma^2_\mu} \overline{y}_{\text{global}} + \frac{n_j\sigma^2_\mu}{\sigma^2_y + n_j \sigma^2_\mu} \overline{y}_j\]

In posterior predictions for artist \(j\), the weights given to the global and local means depend upon how much data we have on artist \(j\) ( \(n_j\) ) as well as the comparison of the within-group and between-group variability in song popularity ( \(\sigma_y\) and \(\sigma_\mu\) ). These weights highlight a couple of scenarios in which individualism fades, i.e., our hierarchical posterior predictions shrink away from the group-specific means \(\overline{y}_j\) and toward the global mean \(\overline{y}_\text{global}\):

Shrinkage increases as the number of observations on group \(j\), \(n_j\), decreases. That is, we rely more and more on global trends to understand a group for which we have little data.
Shrinkage increases when the variability within groups, \(\sigma_y\), is large in comparison to the variability between groups, \(\sigma_\mu\). That is, we rely more and more on global trends to understand a group when there is little distinction in the patterns from one group to the next.