The Good, The Bad, and The Ugly Ways of Representing Data

Talk at Harvey Mudd College
mdogucu.github.io/harvey-mudd-25

Mine Dogucu
University of California Irvine

2025-03-29

Questions for the day


Q1. What is data representation?

Q2. How does one make good data visualizations? (and how does one avoid making bad or ugly data visualizations)

Q1. What is data representation?

an illustration done an old paper with cicrles representing possibly stars with some Chinese characters attached to them.

North Circumpolar Region from the Dunhuang Star Chart circa 649-684 CE.

Recommended reading

Funkhouser, H. G. (1937). Historical Development of the Graphical Representation of Statistical Data. Osiris, 3, 269–404. Chapter 2 is on The Origin of the Graphic Method.

The title of the plot reads change in kindergarten measles vaccination rates. On the x axis the values range from 80% to 100%. Each state has two values reporesented. For instance for Idaho prepandemic vaccination rate is around 89% but vaccination rate in 2023-2024 is about 80%. There is a line labeled as Idaho with an arrow showing the direction from 89% to 80%. Other statements some in the direction of increases as well as US average is visible.
01:00

There are spirals at the edge of a circle going inwards. Outside ring is labeled as 1875 and $21.86. Going from outer rings to inner rings labels are as follows 1880 $488.532, 1885 $738.170, 1890 $1173.684, 1885 $1322.894, 1888 $1434.975. The length of the rings increase as the dollar amounts increase.

Assessed value of household and kitchen furniture owned by Black people in Georgia.

A non-square grid of wooden sticks with some material stuck to intersections.

20th century navigational chart from Kwajalein Attoll, Marshall Islands, Micronesia on display at Bower Museum in Santa Ana. Photo by Mine Dogucu.

Visualization by Mona Chalabi.

nprscience · Supernova Sonification (Two)

Wanda Díaz-Merced is a Puerto Rican astronomer known for using sonification while studying stars. She is the director of the Arecibo Observatory.

The figure shows a layered wedding cake. In the caption it says "according to data from the city of Buenos Aires since 2010 the number of LGBTQ+ marriages per year has almost tripled" At the top of the cake there is a gay couple. The following years and numbers are associated with each layer of the cake going from the top layer to the bottom layer 2010 and 786, 2014 and 870, 2018 and 1038, 2022 and 1720.

Same-sex marriages in Buenos Aires City by Macarena Zappe

This is a table with a title "Excess mortality since region/country's first 50 covid deaths" Columns of the table include region/country, time period, covid-19 deaths (which shows number of deaths in a bar corresponding to length),total excess death (which also shows number of deaths in a different colored bar corresponding to length), and covid-19 as % of total. The bar lengths go from longest to shortest from top to bottom of the table.

COVID related deaths table by the Economist

A painting on canvas with mostly black colors and gray text. In the middle of the painting there are portraits of four little girls.

Howardena Pindell’s Four Litte Girls

illustration which shows each metro stop as a circle and each metro line with a different color.

LA Metro Rail map

Q1. What is data representation?

Data representation refers to the way we structure data in a way to make it easier for us to understand the trends, patterns, relationships that are found in the raw data.

Data visualization is one way of representing data. We have also seen data sonification and data in tactile format (e.g., the sticks).

Tables, maps, charts, plots, infographics are some ways of visualizing data.

The coordinate system, length, width, area, volume, color, and, shape are ways we map data as visual elements.

Different tools (e.g., pen, paper, digital platforms, software, physical objects) can be used to represent data.

Data representations do not necessarily have to be made by a data scientist but you need to understand data science, domain discipline, and art.

Data representations can be used for exploratory reasons and explanatory reasons.

Q2. How does one make good data visualizations?

Avoid deception

Truncated Axis

The title of the plot reads "if Bush tax cuts expire" It is a bar plot the first bar is labeled as now at 35% the next bar is labeled as Jan 1, 2013 labeled at 39.6% The y axis starts at 34%.

Same as the previous graph accept that the y axis goes from 0 to 40% and thus the bars don't seem that different from one another.

Tip

Whenever possible start the axis at zero.

Aspect ratio

three plots with same data. The x-axis always has year and the y-axis always has life expectancy. The plots are labeled as aspect ration 1:2, aspect ration 1:1, and aspect ration 2:1. In the first plot the x axis is double the y axis. In the third plot the y axis is double that of x axis. Thus in the first plot the trend can be perceived to have a low positive slope where as in the third plot the trend seems like a steeper positive change.

Choose colors with a purpose

Color for grouping

The title reads "Estimated share of children with blood levels at or above 5 micrograms per deciliter. Each country is shown as a circle on the plot scattered around with y axis labeled as going from higher rates of elevated lead levels around 100% to lower rates of elevated lead levels going all the way down to zero percent. Each circle has a different color which represents the region such as Africa, Asia, Europe, Middle East, North America, Ocenia, and South America."

Color for representing numeric values

The title of the plot reads "Local news is now an endangered species in much of the United States. The plot shows county level US map, each county is colored with the legent ranging from none (shown in red) to 10+ news outlets (shown in green)"

Color for emphasis

The title of the plot states warmth in the Gulf of Mexico. On the x axis we see months, on the Y axis we see values ranging from 0 to 80 kJ/cm^2. We are also provided a text "This chart shows a measure of ocean heat content expressed by kilojoules per square centimeter". There are many gray curves each representing an individual year, and a dotted curve for showing average 2012-2023. These curves seem to pick between Aug-Oct. There is one curve that is red and has a specific point labeled as Oct 7, 2024. This curve seems above the dotted curve.

Color Theory

The Hue bar (top) shows the full range of color hues mapped to degrees from 0° to 360°, wrapping around the color wheel—starting at red, through yellow, green, blue, magenta, and back to red. The Saturation bar (middle) shows how "intense" or "pure" the color is, going from 0% (completely desaturated, i.e., grayscale) to 100% (fully saturated, pure color). The Lightness/Brightness bar (bottom) shows how light or dark the color is, from 0% (black) to 100% (white), with the pure color appearing in the middle when lightness is 50%.

How to Pick a Color Palette

Fonts matter

Fonts matter for clarity

a food packaging that reads as key lime tarts but the font used makes the letters t in the words tarts seem like an f instead.

Fonts matter for the message

Two postit note both of which say please be mine. The left note is written with a curvy almost cursive font. The right note is written with a font that looks like blood is dripping.



Comparison of four numeric styles Tabular Lining, Proportional Lining, Tabular Oldstyle, and Proportional Oldstyle. Each style displays the number '1984'. Tabular styles align numbers to equal widths; proportional styles use variable widths. Lining styles have uniform height; oldstyle styles use varying heights with some digits extending above or below the baseline.

Tip

Use lining and tabular fonts for numbers.




An example

Many design decisions go into making a data visualization. The following example is from one of my favorite data visualization experts Cara Thompson shared with CC-BY license.

Data context

Table showing a study on odontoblast length (cells responsible for tooth growth) based on type and dose of supplement. The table compares Ascorbic Acid (Vitamin C) and Orange Juice at three doses 0.5 mg, 1 mg, and 2 mg. Each cell in the table contains rows of guinea pig face icons representing individual subjects in each condition.

This is a bar plot with x axis labeled as 0.5, 1, and 2 for each bar. Within each bar we see two colors red and blue. In the legend the supp variable is defined with red as OJ and blue as VC. The y-axis shows mean-length.

this plot is a dodged barplot where the OJ and VC supp is shown next to each other as separate bars.

that bars get a white outline

the legend text changes to supplement, Orange Juice, and Vitamin C

the gray background is replace with a white one

the x axis is now labeled as categorical_dose and there is no value of 1.5 which was initially a gap between dose 1 and dose 2 bars.

the orange juice and vitamin c is separated into two facets with orange juice on top as a separate bar plot.

There is a title that reads "In smaller doses, Orange Juice was associated with greater mean tooth growth, compared to equivalent doses of Vitamin C" and a subtitle that reads "With the highest dose, the mean recorded length was almost identical."

Vitamin C bars are now shown with reddish orange color and orange juice is shown with a yellowish orange color.

dose is introduced to legend with lower-to high dose ranging in light to dark. This change is reflected in the colors of the bars too.

legend is removed

x and y axis are flipped

title is bolded, fonts have changed.

white space is added between subtitle and the plots

the y-axis label is removed.

reduced the line spacing of two lines of the title

the words orange juice and vitamin c in the title match the corresponding colors of the bars.

The toothgrowth figure with the initial software defaults

The toothgrowth figure after all the desired changes

Tip

Do not rely on software defaults for font size, font type, colors, labels, text alignment, legend, etc. without intention.

Write alternate text

Screen reader example

The video shows use of a screen reader briefly.

Alternate Text

  • “Alt text” describes contents of an image.
  • Screen-readers cannot read images but can read alt text.
  • Alt text has to be provided.

Manual Alternate Text

  • Chart type

  • Type of data

  • Reason for including the chart

  • Link to data or source (not in alt text but in main text)

Cesal, 2020

  • Description conveys meaning in the data

  • Variables included on the axes

  • Scale described within the description

  • Type of plot is described

Canelón & Hare, 2021

Consider ethical implications

Ethics of data visualization

In the Bayes Rules! book, authors raise the following questions for model fairness. We can extend these to data visualizations.

How was the data collected? By whom and for what purpose? How might the results of the analysis, or the data collection itself, impact individuals and society? What biases or power structures might be baked into this analysis?

New Yorker style cartoon where a man is looking at a cat next to a litter box. The text reads "Never, ever, think outside the box"

Since you are not a cat, for Datathon think outside the box PLEASE

QUESTIONS?


Slides at mdogucu.github.io/harvey-mudd-25.

Source code for slides at github.com/mdogucu/harvey-mudd-25.


minedogucu.com
mdogucu
mastodon.social/@MineDogucu
bsky.app/profile/minedogucu.com
minedogucu