Context is King

Shannon Pileggi

The setting

An email


Hi Shannon -

I see Travis is out on vacation. Can you re-run the flight delay report? Please walk us through the numbers at the next nycflights project meeting.

Thanks,

Sarah

Director of Flight Operations

Success πŸŽ‰


Characteristic John F Kennedy Intl
N = 111,2791
La Guardia
N = 104,6621
Newark Liberty Intl
N = 120,8351
delay_category


    Early 61,146 (56%) 63,129 (62%) 59,300 (50%)
    On time 6,239 (5.7%) 4,690 (4.6%) 5,585 (4.7%)
    Late 42,031 (38%) 33,690 (33%) 52,711 (45%)
    Unknown 1,863 3,153 3,239
1 n (%)

The email


Hi Shannon -

I see Travis is out on vacation. Can you re-run the flight delay report? Please walk us through the numbers at the next nycflights project meeting.

Thanks,

Sarah

Director of Flight Operations

Walk through the numbers


56% of flights…

Characteristic John F Kennedy Intl
N = 111,2791
La Guardia
N = 104,6621
Newark Liberty Intl
N = 120,8351
delay_category


    Early 61,146 (56%) 63,129 (62%) 59,300 (50%)
    On time 6,239 (5.7%) 4,690 (4.6%) 5,585 (4.7%)
    Late 42,031 (38%) 33,690 (33%) 52,711 (45%)
    Unknown 1,863 3,153 3,239
1 n (%)


departing from JFK

departed early


arriving at JFK

arrived early

😬😱



Characteristic John F Kennedy Intl
N = 111,2791
La Guardia
N = 104,6621
Newark Liberty Intl
N = 120,8351
delay_category


    Early 61,146 (56%) 63,129 (62%) 59,300 (50%)
    On time 6,239 (5.7%) 4,690 (4.6%) 5,585 (4.7%)
    Late 42,031 (38%) 33,690 (33%) 52,711 (45%)
    Unknown 1,863 3,153 3,239
1 n (%)


A journey to understand


source data &

downstream variables

Data stewardship

The code


flights_delay <- flights |>
  select(origin, dest, arr_delay, dep_delay) |>
  left_join(
    select(airports, faa, name),
    join_by(origin == faa)
    ) |>
  mutate(
    delay_category = case_when(
      dep_delay < 0 ~ "Early",
      dep_delay == 0 ~ "On time",
      dep_delay > 0 ~ "Late"
    ) |> fct_relevel("Early", "On time", "Late")
  ) 

nycflights13

Source data context

View(flights)

?flights

Source data context

View(flights)

External excel file



Source data context can and should be embedded in your data

View(flights_labelled)

Assigning variable labels

flights_labelled <- flights |>
  labelled::set_variable_labels(
    year  = "Flight year of departure",
    month = "Flight month of departure",
    ...
  )


flights_labelled <- flights
attr(flights_labelled$year,  "label") <- "Flight year of departure"
attr(flights_labelled$month, "label") <- "Flight month of departure"
...

Viewing labelled data

View(flights)

View(flights_labelled)

Viewing labelled data

View(flights)

View(flights_labelled)

Viewing labelled data

str(flights_labelled)

Identifying downstream context

flights_delay <- flights |>
  select(origin, dest, arr_delay, dep_delay) |>
  left_join(
    select(airports, faa, name),
    join_by(origin == faa)
    ) |>
  mutate(
    delay_category = case_when(
      dep_delay < 0 ~ "Early",
      dep_delay == 0 ~ "On time",
      dep_delay > 0 ~ "Late"
    ) |> fct_relevel("Early", "On time", "Late")
  )  

delay_category represents

Departure timing by origin airport

Downstream data context can and should be embedded in your data

flights_delay_labelled <- flights_labelled |>
  select(origin, dest, arr_delay, dep_delay) |>
  left_join(
    select(airports_labelled, faa, name),
    join_by(origin == faa)
    ) |>
  mutate(
    delay_category = case_when(
      dep_delay < 0 ~ "Early",
      dep_delay == 0 ~ "On time",
      dep_delay > 0 ~ "Late"
    ) |> fct_relevel("Early", "On time", "Late")
  )  |>
  labelled::set_variable_labels(
    delay_category = "Departure timing by origin airport",
    name = "Origin airport"
  )

Assigning variable labels encourages a disciplined practice of creating explicit and succinct variable descriptions, ensuring that data context lives with the data.


This helps:

  • current you, future you, & colleagues

  • peer review processes

  • creation of reusable data assets

data stewardship

practice of ensuring

that data assets are

accessible, secure,

trustworthy, and usable

data stewardship

practice of ensuring

that data assets are

accessible, secure,

trustworthy, and usable

Applications

Data dictonary

flights_schema <- tibble::lst(
  airlines_labelled,
  airports_labelled,
  flights_labelled,
  planes_labelled,
  weather_labelled
)

flights_dictionary <- flights_schema |>
  map(labelled::generate_dictionary) |>
  enframe() |>
  unnest(cols = value)

View(flights_dictionary)

Figures, unlabelled

flights_delay_labelled |>
  ggplot(aes(x = name, fill = delay_category)) +
  geom_bar()

Figures, labelled

flights_delay_labelled |>
  ggplot(aes(x = name, fill = delay_category)) +
  geom_bar() +
  ggeasy::easy_labs() 

Tabling, unlabelled

flights_delay |>
  select(name, delay_category) |>
  gtsummary::tbl_summary(
    by = name
  ) 
Characteristic John F Kennedy Intl
N = 111,2791
La Guardia
N = 104,6621
Newark Liberty Intl
N = 120,8351
delay_category


    Early 61,146 (56%) 63,129 (62%) 59,300 (50%)
    On time 6,239 (5.7%) 4,690 (4.6%) 5,585 (4.7%)
    Late 42,031 (38%) 33,690 (33%) 52,711 (45%)
    Unknown 1,863 3,153 3,239
1 n (%)

Tabling, labelled

flights_delay_labelled |>
  select(name, delay_category) |>
  gtsummary::tbl_summary(
    by = name
  ) 
Characteristic John F Kennedy Intl
N = 111,2791
La Guardia
N = 104,6621
Newark Liberty Intl
N = 120,8351
Departure timing by origin airport


    Early 61,146 (56%) 63,129 (62%) 59,300 (50%)
    On time 6,239 (5.7%) 4,690 (4.6%) 5,585 (4.7%)
    Late 42,031 (38%) 33,690 (33%) 52,711 (45%)
    Unknown 1,863 3,153 3,239
1 n (%)

In practice

nycflights has

5 data frames &

53 variables

a clinical trial has

90 data frames &

1400 variables

source downstream
data frames 90 50
variables 1400 700

Our strategy - single data frame

  1. Maintain a csv with metadata.

Our strategy - single data frame

  1. Maintain a csv with metadata.

  2. Apply custom function for bulk label assignment.

flights_delay_labelled <- flights |>
  select(...) |>
  left_join(...) |> 
  mutate(...) |> 
  croquet::set_derived_variable_labels(
    df_name = "flights_delay",
    path = "nycflights_variables.csv"
  )

Our strategy - list of data frames

  1. Maintain a csv with metadata.

Our strategy - list of data frames

  1. Maintain a csv with metadata.

  2. Apply custom function for bulk label assignment.

flights_schema_labelled <-
  purrr::imap(
    flights_schema_unlabelled,
    \(x, y) croquet::set_derived_variable_labels(
      data = x,
      df_name = y,
      path = "nycflights_variables.csv"
    )
  )

Wrap up

Summary

Summary

R

RStudio

R data

R

RStudio

R data

R

Python

Julia

RStudio

VS Code

Positron

R data

SAS

XPT

parquet

csv

JSON

DuckDB

PostgreSQL


Do you have sufficient metadata to facilitate reusable data assets?


Can you access and leverage the metadata in your programming environment?

Resources

Cheers to variable labels πŸ₯‚


Thank you to the many individuals who helped me develop this talk. Your support was invaluable.


Travis consented use

of his name. πŸ€—