Hi Shannon -
I see Travis is out on vacation. Can you re-run the flight delay report? Please walk us through the numbers at the next nycflights project meeting.
Thanks,
Sarah
Director of Flight Operations
Characteristic | John F Kennedy Intl N = 111,2791 |
La Guardia N = 104,6621 |
Newark Liberty Intl N = 120,8351 |
---|---|---|---|
delay_category | |||
Early | 61,146 (56%) | 63,129 (62%) | 59,300 (50%) |
On time | 6,239 (5.7%) | 4,690 (4.6%) | 5,585 (4.7%) |
Late | 42,031 (38%) | 33,690 (33%) | 52,711 (45%) |
Unknown | 1,863 | 3,153 | 3,239 |
1 n (%) |
Hi Shannon -
I see Travis is out on vacation. Can you re-run the flight delay report? Please walk us through the numbers at the next nycflights project meeting.
Thanks,
Sarah
Director of Flight Operations
56% of flightsβ¦
Characteristic | John F Kennedy Intl N = 111,2791 |
La Guardia N = 104,6621 |
Newark Liberty Intl N = 120,8351 |
---|---|---|---|
delay_category | |||
Early | 61,146 (56%) | 63,129 (62%) | 59,300 (50%) |
On time | 6,239 (5.7%) | 4,690 (4.6%) | 5,585 (4.7%) |
Late | 42,031 (38%) | 33,690 (33%) | 52,711 (45%) |
Unknown | 1,863 | 3,153 | 3,239 |
1 n (%) |
departing from JFK
departed early
arriving at JFK
arrived early
π¬π±
Characteristic | John F Kennedy Intl N = 111,2791 |
La Guardia N = 104,6621 |
Newark Liberty Intl N = 120,8351 |
---|---|---|---|
delay_category | |||
Early | 61,146 (56%) | 63,129 (62%) | 59,300 (50%) |
On time | 6,239 (5.7%) | 4,690 (4.6%) | 5,585 (4.7%) |
Late | 42,031 (38%) | 33,690 (33%) | 52,711 (45%) |
Unknown | 1,863 | 3,153 | 3,239 |
1 n (%) |
A journey to understand
source data &
downstream variables
flights_delay <- flights |>
select(origin, dest, arr_delay, dep_delay) |>
left_join(
select(airports, faa, name),
join_by(origin == faa)
) |>
mutate(
delay_category = case_when(
dep_delay < 0 ~ "Early",
dep_delay == 0 ~ "On time",
dep_delay > 0 ~ "Late"
) |> fct_relevel("Early", "On time", "Late")
)
View(flights)
?flights
View(flights)
External excel file
View(flights_labelled)
Photo by Brian McGowan on Unsplash
flights_labelled <- flights |>
labelled::set_variable_labels(
year = "Flight year of departure",
month = "Flight month of departure",
...
)
Further reading
View(flights)
View(flights_labelled)
View(flights)
View(flights_labelled)
str(flights_labelled)
flights_delay <- flights |>
select(origin, dest, arr_delay, dep_delay) |>
left_join(
select(airports, faa, name),
join_by(origin == faa)
) |>
mutate(
delay_category = case_when(
dep_delay < 0 ~ "Early",
dep_delay == 0 ~ "On time",
dep_delay > 0 ~ "Late"
) |> fct_relevel("Early", "On time", "Late")
)
delay_category
represents
Departure timing by origin airport
flights_delay_labelled <- flights_labelled |>
select(origin, dest, arr_delay, dep_delay) |>
left_join(
select(airports_labelled, faa, name),
join_by(origin == faa)
) |>
mutate(
delay_category = case_when(
dep_delay < 0 ~ "Early",
dep_delay == 0 ~ "On time",
dep_delay > 0 ~ "Late"
) |> fct_relevel("Early", "On time", "Late")
) |>
labelled::set_variable_labels(
delay_category = "Departure timing by origin airport",
name = "Origin airport"
)
Image by Elmer L. Geissler from Pixabay
Assigning variable labels encourages a disciplined practice of creating explicit and succinct variable descriptions, ensuring that data context lives with the data.
This helps:
current you, future you, & colleagues
peer review processes
creation of reusable data assets
Photo by Jared Rice on Unsplash
data stewardship
practice of ensuring
that data assets are
accessible, secure,
trustworthy, and usable
Image by GuangWu YANG from Pixabay
data stewardship
practice of ensuring
that data assets are
accessible, secure,
trustworthy, and usable
Image by GuangWu YANG from Pixabay
gif from tenor
Characteristic | John F Kennedy Intl N = 111,2791 |
La Guardia N = 104,6621 |
Newark Liberty Intl N = 120,8351 |
---|---|---|---|
delay_category | |||
Early | 61,146 (56%) | 63,129 (62%) | 59,300 (50%) |
On time | 6,239 (5.7%) | 4,690 (4.6%) | 5,585 (4.7%) |
Late | 42,031 (38%) | 33,690 (33%) | 52,711 (45%) |
Unknown | 1,863 | 3,153 | 3,239 |
1 n (%) |
Characteristic | John F Kennedy Intl N = 111,2791 |
La Guardia N = 104,6621 |
Newark Liberty Intl N = 120,8351 |
---|---|---|---|
Departure timing by origin airport | |||
Early | 61,146 (56%) | 63,129 (62%) | 59,300 (50%) |
On time | 6,239 (5.7%) | 4,690 (4.6%) | 5,585 (4.7%) |
Late | 42,031 (38%) | 33,690 (33%) | 52,711 (45%) |
Unknown | 1,863 | 3,153 | 3,239 |
1 n (%) |
nycflights has
5 data frames &
53 variables
a clinical trial has
90 data frames &
1400 variables
source | downstream | |
---|---|---|
data frames | 90 | 50 |
variables | 1400 | 700 |
Photo by CHUTTERSNAP on Unsplash
Maintain a csv with metadata.
Apply custom function for bulk label assignment.
Maintain a csv with metadata.
Apply custom function for bulk label assignment.
R
RStudio
R data
R
RStudio
R data
Photo by Jeremy Thomas on Unsplash
R
Python
Julia
RStudio
VS Code
Positron
R data
SAS
XPT
parquet
csv
JSON
DuckDB
PostgreSQL
Photo by Jeremy Thomas on Unsplash
Do you have sufficient metadata to facilitate reusable data assets?
Can you access and leverage the metadata in your programming environment?
(2024) nycflights13 demo script
https://github.com/shannonpileggi/context-is-king/blob/main/nycflights-delay-demo.R
(2022) {croquet} package
https://github.com/pcctc/croquet
(2022) The case for variable labels in R
https://www.pipinghotdata.com/posts/2022-09-13-the-case-for-variable-labels-in-r/
(2020) Leveraging labelled data in R
https://www.pipinghotdata.com/posts/2020-12-23-leveraging-labelled-data-in-r/
(2019) Advanced R 2e, Ch 3.3 Attributes
https://adv-r.hadley.nz/vectors-chap.html?q=attributes#attributes
(2015) Commit that introduced labels to the RStudio IDE data viewer
https://github.com/rstudio/rstudio/commit/92026abeb9d9ee7a05bdf30a81a5f4d919ea438e
Thank you to the many individuals who helped me develop this talk. Your support was invaluable.
Travis consented use
of his name. π€
Slides: shannonpileggi.github.io/context-is-king
Website: pipinghotdata.com