Cleaning Medical Data
with R

Crystal Lewis, Shannon Pileggi, & Peter Higgins

Introduction

Licensing

This work is licensed under Creative Commons 4.0 International.

Instructors

Crystal Lewis

Shannon Pileggi

Peter Higgins

Scope

Taming the Data Beast, by Allison Horst

Taming the Data Beast, from Allison Horst’s Data Science Illustrations

Schedule

Time	Topic
11:00 - 11:10	Intro/Logistics
11:10 - 12:00	Crystal Lewis (Principles of Data Management)
12:00 - 12:05	Break
12:05 - 12:55	Shannon Pileggi (Stage 1 Data Cleaning)
12:55 - 1:00	Break
01:00 - 02:00	Peter Higgins (Stage 2 Data Cleaning)

Please add any questions to the public Zoom chat. These may be answered in the moment or addressed at the end depending on context.

Exercises

All exercises can be accessed via Posit Cloud.

See exercise instructions for the link to access the Posit Cloud workspace.

Syntax aside

Pipes

2014+ magrittr pipe %>%
2021+ (R \(\geq\) 4.1.0) native R pipe |>

2022 Isabella Velásquez Understanding the native R pipe |> https://ivelasq.rbind.io/blog/understanding-the-r-pipe/

whatever(arg1, arg2, arg3, ...)

arg1 |>  
  whatever(arg2, arg3)

mean(0:10)

0:10 |> 
  mean()

R for Data Science: Ch 18 Pipes

Namespacing

package::function()

dplyr::select()

tells R explicitly to use the function select from the package dplyr
can help to avoid name conflicts (e.g., MASS::select())
does not require library(dplyr)
generally, we aimed to namespace functions from non-tidyverse packages.

library(dplyr)

select(mtcars, mpg, cyl) 

mtcars |>  
  select(mpg, cyl)

# library(dplyr) not needed

dplyr::select(mtcars, mpg, cyl) 

mtcars |>  
  dplyr::select(mpg, cyl)

Principles of Data Management

Data Horror Stories

Data Organizing Principles

Data Structure
Variable Values

Variable Types
Missing Data

Data Structure

Data should make a rectangle of rows and columns
- You should have the expected number of rows (cases in your data)
- You should have the expected number of columns (variables in your data)

Data Structure

Variable names should be the first, and only the first, row of your data
- They should also adhere to best practices
- Variable names should
  - Be unique
  - Be meaningful (gender instead of X1)
  - Not include spaces
  - Not include special characters except _
    - So no /, -, !, "
  - Not start with a number or special character

Exercise

What data structure issues do you notice in our sample data?

01:00

Data Structure

Variable names are not the first row of the data
Our data does not make a rectangle - Empty column, empty rows
Variable names do not adhere to best practices

Variable Values

Values should be explicit, not implicit
- If a blank cell is implied to be zero, fill that cell with an actual zero
- No color coding should be used to indicate information. Make a new variable.

Values should be analyzable. This means no more than one measure should be captured in a variable.

Variables should be captured consistently within a column
- Dates captured consistently (i.e. YYYY-MM-DD)
- Categories captured consistently (both spelling and capitalization)
- If the variable is numeric, values should fall within your expected range

Exercise

What variable value issues do you notice in our sample data?

01:00

Variable Values

Color coding used to indicate information
Two things measured in one column
Categorical values captured inconsistently

Variable Types

Variables should be stored as your expected type (or in R terms - class)

Numeric
- Contain numeric values (14.5, 14.539, 789, -24)
- Numeric variables cannot contain special characters, spaces, or letters
  - 100mg
  - 83/150
  - ” 89”

Date, Time, Date-Time
- Represented in R as either <date>, <time> or <dttm>/<POSIXct>
- Allow you to perform calculations on dates

Variable Types

Character
- Contain character values or strings (“kg”, “R in Medicine”, “11.5”, “5mg”)

Factor
- A special class of variables, helpful when working with categorical or ordinal variables
- Factors assign an order to your variable groups
- You must assign this class to your variables
- You can learn more about working with factors in this article: Wrangling Categorical Data in R

Exercise

What is the R class for the following variables?

var1
var2
var3

[1] " 7.5" "2"    "3.6"

class(var1)

[1] "character"

[1] medium medium low    high   low   
Levels: low medium high

class(var2)

[1] "factor"

[1] "50kg" "59kg" "82kg"

class(var3)

[1] "character"

Exercise

What variable type issues do you notice in our sample data?

01:00

Variable Types

Dates stored as numbers
Text stored in numeric variables

Missing Data

Missing data should appear as you expect it to
- The amount of missingness
- The variables/cases that data is missing for

Use consistent values to indicate missing responses (Blank, NA, -999)
- Document your decision
- The missing values should match your variable type
  - i.e., Don’t use “no response” in a numeric variable

Exercise

What missing data issues do you notice in our sample data?

01:00

Missing Data

Unexpected missing data
Inconsistent missing values used
Missing values do not match variable type

Error Reduction

The number one way to reduce data errors is to make a plan before you collect data

Correct data at the source

Plan the variables you want to collect

Build your data collection/entry tools in a way that follows your plan

Test your data tools before collecting/entering data

Check your data often during data collection/entry

Plan the variables you want to collect

Necessary to plan for

Variable name
Variable label/item wording
Variable type
Allowable values/ranges
Missing values

Nice to plan for

Skip patterns
Required items
Variable universe

Add those variables to a data dictionary

var_name	label	type	values	missing_values
pat_id	Patient Identifier	character	001-030	NA
treatment	Treatment for UC	character	upa; uste; oza	NA
start_date	Date of start of treatment	date	YYYY-MM-DD	NA
ethnic	Ethnicity - hispanic or not hispanic	character	hispanic; not hispanic	missing
start_mes	Mayo endoscopic Score at start of treatment	numeric	0-3	-99

Build your tools based on your data dictionary

Excel
Qualtrics
REDCap
Know the strengths and limitations of your tool
- Consider things such a privacy (PHI), versioning, validation
- Tidy Spreadsheets in Medical Research (70 mins)

Build your tools based on your data dictionary

Name your variables correctly in your tool

Instead of Q1, Q2, Q3 -> id, start_date, treatment

Build items to only accept allowable values

Only within specified range (0-50)
Only within specified categories (“hispanic”, “not hispanic”)

Build items to only accept specified variable types

Only numeric values
Only dates in the YYYY-MM-DD format

Test your data collection or entry tool

Collect/enter sample data
- Are any items missing?
- Are you getting unexpected values for items?
  - Values out of range
  - Incorrect formats
  - Inconsistent entries
    - “m”, “male”, “Male”, “MALE”
- Is the skip logic working as expected?
- Are people able to skip items that they should not be able to skip?

Review your data often during data collection or entry

Validate your data based on your expectations
- pointblank
- validate
- assertr
- dataquieR
- Excellent resource: Data Validation in R: From Principles to Tools and Packages (80 minutes)

library(pointblank)

# Import my data

df_raw <- readxl::read_excel("data/mydata.csv")

# Check my assumptions

create_agent(df_raw) |>
  rows_distinct(columns = vars(pat_id)) |>
  col_vals_not_null(columns = vars(pat_id)) |>
  col_is_date(columns = vars(start_date)) |>
  col_is_numeric(columns = vars(start_mes)) |>
  col_vals_in_set(columns = vars(treatment), set = c("upa", "uste", "oza")) |>
  col_vals_in_set(columns = vars(ethnic), set = c("hispanic", "not hispanic")) |>
  col_vals_between(columns = vars(start_mes), left = 0, right = 3, na_pass = FALSE) |>
  interrogate()

Review your data often during data collection

Create a codebook to review univariate summary statistics
- codebookr
- codebook
- memisc
- sjPlot

`codebookr` codebook

Code
Codebook

library(codebookr)

# Import my data

df_raw <- readxl::read_excel("data/mydata.csv")

# Create my codebook

df_codebook <- codebook(df_raw)

print(df_codebook,"my_codebookr_codebook.docx")

Data Cleaning Practices

Scenario

We have data that originate from an observational study comparing 3 treatments of ulcerative colitis (UC)
We have an analysis question:
- Are there differences in change in MES and QOL scores between start and finish, and are the decreases in scores greater for any of the 3 new medications?
In order to answer this question, we have asked a student to extract data from the medical record into Excel
Along with the spreadsheet, we are provided a data dictionary
As we start to review the data, we find a sundry of errors that need correction

Exercise

Take 5 minutes to review the data dictionary and our data.

Log in to Posit Cloud and navigate to our project
Open the data folder and open the file “messy_uc.xlsx”

When you finish, give us a 👍
If you are having trouble, give us a ✋

05:00

Import our file

We are going to use the read_excel() function from the readxl package
There are several arguments to consider when using this function
- path
- sheet = NULL
- col_names = TRUE
- na = ” ”
- skip = 0
type ?read_excel in your console to see more arguments

Import our file

Script
Code
Data

library(readxl)

# Import our file

df_raw <- readxl::read_excel("data/messy_uc.xlsx",
  sheet = "__", 
  skip = __
)

pat_id	treatment	start_date	ethnic	race	dob	...7	start_bp	pre/post_wt_kg	start_mes
001	upa	44208	hispanic	Caucasian	2005-01-07	NA	114/72	84/82	3
002	uste	44215	not hispanic	Caucasian	1937-04-13	NA	132/86	77/77	2
003	oza	44230	not hispanic	African-American	1946-06-06	NA	124/92	74/75	1
004	upa	44245	not hispanic	Caucasian	1963-07-14	NA	144/83	66/65	3
005	oza	44255	not hispanic	Mixed	1978-05-12	NA	122/78	55/56	3
006	uste	44259	not hispanic	Other	1992-04-03	NA	121/80	111/110	2
007	uste	44264	not hispanic	Asian	1955-08-22	NA	133/74	133/130	3
008	oza	44999	Hispanic	afromerican	1974-09-11	NA	116/73	74/76	3
009	upa	44276	NOT hispanic	Caucasian	1984-11-14	NA	118/66	82/80	2
010	oza	44278	hispamnic	Caucasian	1972-12-20	NA	122/78	85/87	3

Exercise (CL1)

Your turn! Take 3 minutes to import the data.

Open “exercises.qmd” in our Posit Cloud project
Navigate to ## CL1
Update the code and run the code chunk using the green arrow

–> Take me to the exercises <–

03:00

Review the data

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. - R for Data Science

Review the data

Get to know your data

How many rows? How many columns?
What are the variable types?
What are variable values?
How much missing data is there?
How are variables related?

There are several functions that can be used to explore data

dplyr::glimpse()
skimr::skim()
base::summary()
visdat:vis_dat()
summarytools::dfSummary()
DataExplorer::create_report()
Hmisc::describe()

`summarytools::dfSummary()`

Code
Output

library(summarytools)

# Review our data

dfSummary(df_raw)

`skimr::skim()`

Code
Output

library(skimr)

# Review our data

skim(df_raw)

Exercise (CL2)

Use one or more of these exploratory packages to review your data. What fixes do you see that need to happen?

dplyr::glimpse()
skimr::skim()
base::summary()
visdat:vis_dat()

summarytools::dfSummary()
DataExplorer::create_report()
Hmisc::describe()

–> Take me to the exercises <–

05:00

Stage 1
Data Cleaning

Variable names

Original variable names in excel:

Variable names import as shown, with modifications from readxl::read_excel() to ensure uniqueness:

Variable names, cleaner

Variable names as imported:

janitor::clean_names() removes special characters and implements snake case by default:

df_clean <- df_raw |> 
  janitor::clean_names()

Remove empty columns or rows

Problem
Solution
Confirm

df_clean |> 
  select(pat_id, race:start_bp) |> 
  slice(13:18)

# A tibble: 6 × 5
  pat_id race             dob                 x7    start_bp
  <chr>  <chr>            <dttm>              <lgl> <chr>   
1 013    Caucasian        1948-02-27 00:00:00 NA    118/73  
2 014    African-American 1966-04-22 00:00:00 NA    106/59  
3 015    H/API            1978-08-11 00:00:00 NA    112/69  
4 <NA>   <NA>             NA                  NA    <NA>    
5 016    African-American 1998-10-28 00:00:00 NA    114/76  
6 017    Caucasian        2001-01-09 00:00:00 NA    124/80

df_clean <- df_raw |> 
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols"))

df_clean |> 
  select(pat_id, race:start_bp) |> 
  slice(13:18)

# A tibble: 6 × 4
  pat_id race             dob                 start_bp
  <chr>  <chr>            <dttm>              <chr>   
1 013    Caucasian        1948-02-27 00:00:00 118/73  
2 014    African-American 1966-04-22 00:00:00 106/59  
3 015    H/API            1978-08-11 00:00:00 112/69  
4 016    African-American 1998-10-28 00:00:00 114/76  
5 017    Caucasian        2001-01-09 00:00:00 124/80  
6 018    Caucasian        1994-03-07 00:00:00 120/68

#
df_raw |>
  janitor::clean_names() |> 
  glimpse()

Rows: 31
Columns: 38
$ pat_id                           <chr> "001", "002", "003", "004", "005", "0…
$ treatment                        <chr> "upa", "uste", "oza", "upa", "oza", "…
$ start_date                       <dbl> 44208, 44215, 44230, 44245, 44255, 44…
$ ethnic                           <chr> "hispanic", "not hispanic", "not hisp…
$ race                             <chr> "Caucasian", "Caucasian", "African-Am…
$ dob                              <dttm> 2005-01-07, 1937-04-13, 1946-06-06, …
$ x7                               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ start_bp                         <chr> "114/72", "132/86", "124/92", "144/83…
$ pre_post_wt_kg                   <chr> "84/82", "77/77", "74/75", "66/65", "…
$ start_mes                        <dbl> 3, 2, 1, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3…
$ start_bss                        <dbl> 75.85449, 54.65536, 28.46290, 77.2364…
$ start_abd_score                  <dbl> 76.80263, 53.56908, 30.82623, 77.7276…
$ start_sys                        <dbl> 81.63807, 53.39802, 29.50837, 79.0785…
$ start_coping                     <dbl> 50.75913, 28.23465, 20.03401, 45.1019…
$ start_emo                        <dbl> 79.31260, 58.55421, 32.22812, 75.6089…
$ daily_life_impact_score_at_start <dbl> 87.81723, 59.87844, 28.39321, 86.1173…
$ start_wbc                        <dbl> 8.2, 10.1, 5.5, 4.7, 8.9, 9.3, 5.6, 9…
$ start_plt                        <chr> "273K/microL", "414K/microL", "323K/m…
$ start_na                         <chr> "137mmol/L", "142mmol/L", "140mmol/L"…
$ start_k                          <chr> "3.7", "4.0999999999999996", "4.3", "…
$ end_month                        <dbl> 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9…
$ end_day                          <dbl> 14, 21, 6, 22, 30, 4, 10, 15, 22, 26,…
$ end_year                         <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2…
$ end_mes                          <dbl> 0, 1, 1, 1, 2, 1, 2, 2, 0, 1, 0, 1, 1…
$ end_bss                          <dbl> 9.854493, 32.655357, 28.462897, 33.23…
$ end_abd                          <dbl> 16.802628, 33.569079, 30.826225, 37.7…
$ end_sys                          <dbl> 9.638072, 29.398023, 29.508373, 31.07…
$ end_coping                       <dbl> 23.759131, 19.234646, 20.034009, 27.1…
$ end_emo                          <chr> "39.760954572288782", "49.36928154791…
$ end_dl                           <chr> "9.4097102977463347", "35.23874059870…
$ end_wbc                          <dbl> 8.541698, 11.076921, 3.000000, 4.8581…
$ end_plt                          <dbl> 201, 340, 256, 327, 432, 348, 181, 12…
$ end_na                           <dbl> 137.3278, 142.2140, 140.0831, 139.158…
$ end_k                            <dbl> 3.958430, 4.174979, 4.433533, 3.53357…
$ fake_street                      <chr> "990 Mohammad Mountain", "8512 O'Conn…
$ fake_city                        <chr> "North Sigmundville", "Port Halstad",…
$ fake_state                       <chr> "New Mexico", "Missouri", "South Caro…
$ fake_zip                         <dbl> 96074, 11264, 57246, 31457, 30711, 52…

df_raw |> 
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  glimpse()

Rows: 30
Columns: 37
$ pat_id                           <chr> "001", "002", "003", "004", "005", "0…
$ treatment                        <chr> "upa", "uste", "oza", "upa", "oza", "…
$ start_date                       <dbl> 44208, 44215, 44230, 44245, 44255, 44…
$ ethnic                           <chr> "hispanic", "not hispanic", "not hisp…
$ race                             <chr> "Caucasian", "Caucasian", "African-Am…
$ dob                              <dttm> 2005-01-07, 1937-04-13, 1946-06-06, …
$ start_bp                         <chr> "114/72", "132/86", "124/92", "144/83…
$ pre_post_wt_kg                   <chr> "84/82", "77/77", "74/75", "66/65", "…
$ start_mes                        <dbl> 3, 2, 1, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3…
$ start_bss                        <dbl> 75.85449, 54.65536, 28.46290, 77.2364…
$ start_abd_score                  <dbl> 76.80263, 53.56908, 30.82623, 77.7276…
$ start_sys                        <dbl> 81.63807, 53.39802, 29.50837, 79.0785…
$ start_coping                     <dbl> 50.75913, 28.23465, 20.03401, 45.1019…
$ start_emo                        <dbl> 79.31260, 58.55421, 32.22812, 75.6089…
$ daily_life_impact_score_at_start <dbl> 87.81723, 59.87844, 28.39321, 86.1173…
$ start_wbc                        <dbl> 8.2, 10.1, 5.5, 4.7, 8.9, 9.3, 5.6, 9…
$ start_plt                        <chr> "273K/microL", "414K/microL", "323K/m…
$ start_na                         <chr> "137mmol/L", "142mmol/L", "140mmol/L"…
$ start_k                          <chr> "3.7", "4.0999999999999996", "4.3", "…
$ end_month                        <dbl> 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9…
$ end_day                          <dbl> 14, 21, 6, 22, 30, 4, 10, 15, 22, 26,…
$ end_year                         <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2…
$ end_mes                          <dbl> 0, 1, 1, 1, 2, 1, 2, 2, 0, 1, 0, 1, 1…
$ end_bss                          <dbl> 9.854493, 32.655357, 28.462897, 33.23…
$ end_abd                          <dbl> 16.802628, 33.569079, 30.826225, 37.7…
$ end_sys                          <dbl> 9.638072, 29.398023, 29.508373, 31.07…
$ end_coping                       <dbl> 23.759131, 19.234646, 20.034009, 27.1…
$ end_emo                          <chr> "39.760954572288782", "49.36928154791…
$ end_dl                           <chr> "9.4097102977463347", "35.23874059870…
$ end_wbc                          <dbl> 8.541698, 11.076921, 3.000000, 4.8581…
$ end_plt                          <dbl> 201, 340, 256, 327, 432, 348, 181, 12…
$ end_na                           <dbl> 137.3278, 142.2140, 140.0831, 139.158…
$ end_k                            <dbl> 3.958430, 4.174979, 4.433533, 3.53357…
$ fake_street                      <chr> "990 Mohammad Mountain", "8512 O'Conn…
$ fake_city                        <chr> "North Sigmundville", "Port Halstad",…
$ fake_state                       <chr> "New Mexico", "Missouri", "South Caro…
$ fake_zip                         <dbl> 96074, 11264, 57246, 31457, 30711, 52…

Recoding

Problem
Solution
Confirm

df_clean |> 
  count(ethnic)

# A tibble: 5 × 2
  ethnic           n
  <chr>        <int>
1 Hispanic         1
2 NOT hispanic     1
3 hispamnic        1
4 hispanic         3
5 not hispanic    24

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
      .default = ethnic
    )
  )

df_clean |> 
  count(ethnic_clean)

# A tibble: 2 × 2
  ethnic_clean     n
  <chr>        <int>
1 hispanic         5
2 not hispanic    25

df_clean |> 
  count(ethnic_clean, ethnic)

# A tibble: 5 × 3
  ethnic_clean ethnic           n
  <chr>        <chr>        <int>
1 hispanic     Hispanic         1
2 hispanic     hispamnic        1
3 hispanic     hispanic         3
4 not hispanic NOT hispanic     1
5 not hispanic not hispanic    24

Exercise

Complete Data Cleaning Fundamentals Exercise SP1.

–> Take me to the exercises <–

05:00

Replace values with missing

Problem
Solution
Confirm

df_clean |> 
  count(end_na)

# A tibble: 30 × 2
   end_na     n
    <dbl> <int>
 1   -99      1
 2   133.     1
 3   135.     1
 4   135.     1
 5   136.     1
 6   137.     1
 7   137.     1
 8   138.     1
 9   138.     1
10   138.     1
# ℹ 20 more rows

df_clean |> 
  ggplot(aes(x = end_na)) +
  geom_histogram()

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ),
    end_na_clean = na_if(end_na, -99)
  )

df_clean |> 
  count(end_na, end_na_clean)

# A tibble: 30 × 3
   end_na end_na_clean     n
    <dbl>        <dbl> <int>
 1   -99           NA      1
 2   133.         133.     1
 3   135.         135.     1
 4   135.         135.     1
 5   136.         136.     1
 6   137.         137.     1
 7   137.         137.     1
 8   138.         138.     1
 9   138.         138.     1
10   138.         138.     1
# ℹ 20 more rows

df_clean |> 
  ggplot(aes(x = end_na_clean)) +
  geom_histogram()

Incorrect variable type

Problem
Solution
Confirm

df_raw |> 
  select(end_emo) |> 
  glimpse()

Rows: 31
Columns: 1
$ end_emo <chr> "39.760954572288782", "49.369281547911214", "35.90128630678054…

mean(df_raw[["end_emo"]], na.rm = TRUE)

Warning in mean.default(df_raw[["end_emo"]], na.rm = TRUE): argument is not
numeric or logical: returning NA

[1] NA

df_raw[["end_emo"]]

 [1] "39.760954572288782" "49.369281547911214" "35.901286306780541"
 [4] "50.2057532077703"   "64.973437247083666" "35.965982580017993"
 [7] "57.76367623799068"  "61.960515932643503" "27.768279853236116"
[10] "45.948161004250508" "37.34426764708568"  "51.859745680890889"
[13] "55.055731509083827" "62.246053520477801" "not done"          
[16] NA                   "51.794563346733348" "28.024233546743879"
[19] "58.368126398937022" "13.120426912160625" "45.417571554273906"
[22] "41.388894056374887" "27.007297504717883" "64.595741275696824"
[25] "50.243011921919624" "52.659962900705473" "41.581059324621933"
[28] "33.176949642484104" "67.258563644329342" "22.931081374117134"
[31] "31.118353038978491"

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric()
  )

df_clean |> 
  select(end_emo, end_emo_clean) |> 
  glimpse()

Rows: 30
Columns: 2
$ end_emo       <chr> "39.760954572288782", "49.369281547911214", "35.90128630…
$ end_emo_clean <dbl> 39.76095, 49.36928, 35.90129, 50.20575, 64.97344, 35.965…

mean(df_clean[["end_emo_clean"]], na.rm = TRUE)

[1] 44.99341

df_clean |> 
  count(end_emo_clean, end_emo)

# A tibble: 30 × 3
   end_emo_clean end_emo                n
           <dbl> <chr>              <int>
 1          13.1 13.120426912160625     1
 2          22.9 22.931081374117134     1
 3          27.0 27.007297504717883     1
 4          27.8 27.768279853236116     1
 5          28.0 28.024233546743879     1
 6          31.1 31.118353038978491     1
 7          33.2 33.176949642484104     1
 8          35.9 35.901286306780541     1
 9          36.0 35.965982580017993     1
10          37.3 37.34426764708568      1
# ℹ 20 more rows

Correcting dates

Problem
Solution
Confirm

df_raw |> 
  select(start_date) |> 
  glimpse()

Rows: 31
Columns: 1
$ start_date <dbl> 44208, 44215, 44230, 44245, 44255, 44259, 44264, 44999, 442…

df_raw[["start_date"]]

 [1] 44208 44215 44230 44245 44255 44259 44264 44999 44276 44278 44297 44308
[13] 44313 44318 44324    NA 44329 44332 44346 44358 44370 44383 44391 44397
[25] 44412 44425 44434 44444 44461 44475 44500

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric(),
    start_date_clean = janitor::convert_to_date(start_date)
  )

df_clean |> 
  select(start_date, start_date_clean) |> 
  glimpse()

Rows: 30
Columns: 2
$ start_date       <dbl> 44208, 44215, 44230, 44245, 44255, 44259, 44264, 4499…
$ start_date_clean <date> 2021-01-12, 2021-01-19, 2021-02-03, 2021-02-18, 2021…

df_clean |> 
  count(start_date, start_date_clean)

# A tibble: 30 × 3
   start_date start_date_clean     n
        <dbl> <date>           <int>
 1      44208 2021-01-12           1
 2      44215 2021-01-19           1
 3      44230 2021-02-03           1
 4      44245 2021-02-18           1
 5      44255 2021-02-28           1
 6      44259 2021-03-04           1
 7      44264 2021-03-09           1
 8      44276 2021-03-21           1
 9      44278 2021-03-23           1
10      44297 2021-04-11           1
# ℹ 20 more rows

Extracting numbers from text

Problem
Solution
Confirm

df_raw |> 
  select(start_na) |> 
  glimpse()

Rows: 31
Columns: 1
$ start_na <chr> "137mmol/L", "142mmol/L", "140mmol/L", "139mmol/L", "144mmol/…

mean(df_raw[["start_na"]], na.rm = TRUE)

Warning in mean.default(df_raw[["start_na"]], na.rm = TRUE): argument is not
numeric or logical: returning NA

[1] NA

df_raw[["start_na"]]

 [1] "137mmol/L" "142mmol/L" "140mmol/L" "139mmol/L" "144mmol/L" "145mmol/L"
 [7] "142mmol/L" "138mmol/L" "140mmol/L" "137mmol/L" "143mmol/L" "136mmol/L"
[13] "135mmol/L" "141mmol/L" "133mmol/L" NA          "135mmol/L" "143mmol/L"
[19] "136mmol/L" "144mmol/L" "145mmol/L" "140mmol/L" "141mmol/L" "142mmol/L"
[25] "138mmol/L" "139mmol/L" "142mmol/L" "144mmol/L" "139mmol/L" "138mmol/L"
[31] "140mmol/L"

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric(),
    start_na_clean = parse_number(start_na)
  )

df_clean |> 
  select(start_na, start_na_clean) |> 
  glimpse()

Rows: 30
Columns: 2
$ start_na       <chr> "137mmol/L", "142mmol/L", "140mmol/L", "139mmol/L", "14…
$ start_na_clean <dbl> 137, 142, 140, 139, 144, 145, 142, 138, 140, 137, 143, …

mean(df_clean[["start_na_clean"]], na.rm = TRUE)

[1] 139.9333

df_clean |> 
  count(start_na_clean, start_na)

# A tibble: 12 × 3
   start_na_clean start_na      n
            <dbl> <chr>     <int>
 1            133 133mmol/L     1
 2            135 135mmol/L     2
 3            136 136mmol/L     2
 4            137 137mmol/L     2
 5            138 138mmol/L     3
 6            139 139mmol/L     3
 7            140 140mmol/L     4
 8            141 141mmol/L     2
 9            142 142mmol/L     4
10            143 143mmol/L     2
11            144 144mmol/L     3
12            145 145mmol/L     2

Exercise

Complete Data Cleaning Fundamentals Exercise SP2.

–> Take me to the exercises <–

05:00

Character variable should be a factor

Problem
Solution
Confirm

df_clean |> 
  count(treatment)

# A tibble: 3 × 2
  treatment     n
  <chr>     <int>
1 oza          10
2 upa          10
3 uste         10

df_clean |> 
  count(ethnic_clean)

# A tibble: 2 × 2
  ethnic_clean     n
  <chr>        <int>
1 hispanic         5
2 not hispanic    25

df_clean |> 
  select(treatment, ethnic_clean) |> 
  glimpse()

Rows: 30
Columns: 2
$ treatment    <chr> "upa", "uste", "oza", "upa", "oza", "uste", "uste", "oza"…
$ ethnic_clean <chr> "hispanic", "not hispanic", "not hispanic", "not hispanic…

df_clean |> 
  select(treatment, ethnic_clean) |> 
  gtsummary::tbl_summary(by = treatment)

Characteristic	oza, N = 10¹	upa, N = 10¹	uste, N = 10¹
ethnic_clean
hispanic	4 (40%)	1 (10%)	0 (0%)
not hispanic	6 (60%)	9 (90%)	10 (100%)
¹ n (%)

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ) |> fct_infreq(),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric(),
    start_na_clean = parse_number(start_na),
    treatment = fct_relevel(treatment, "upa", "uste", "oza")
  )

See the forcats package for other factor handling solutions.

df_clean |> 
  count(treatment)

# A tibble: 3 × 2
  treatment     n
  <fct>     <int>
1 upa          10
2 uste         10
3 oza          10

df_clean |> 
  count(ethnic_clean)

# A tibble: 2 × 2
  ethnic_clean     n
  <fct>        <int>
1 not hispanic    25
2 hispanic         5

df_clean |> 
  select(treatment, ethnic_clean) |> 
  glimpse()

Rows: 30
Columns: 2
$ treatment    <fct> upa, uste, oza, upa, oza, uste, uste, oza, upa, oza, upa,…
$ ethnic_clean <fct> hispanic, not hispanic, not hispanic, not hispanic, not h…

df_clean |> 
  select(treatment, ethnic_clean) |> 
  gtsummary::tbl_summary(by = treatment)

Characteristic	upa, N = 10¹	uste, N = 10¹	oza, N = 10¹
ethnic_clean
not hispanic	9 (90%)	10 (100%)	6 (60%)
hispanic	1 (10%)	0 (0%)	4 (40%)
¹ n (%)

Exercise

Complete Data Cleaning Fundamentals Exercise SP3.

–> Take me to the exercises <–

05:00

Separating values

Problem
Solution
Confirm

df_clean |> 
  select(start_bp) |> 
  glimpse()

Rows: 30
Columns: 1
$ start_bp <chr> "114/72", "132/86", "124/92", "144/83", "122/78", "121/80", "…

mean(df_clean[["start_bp"]], na.rm = TRUE)

[1] NA

df_clean[["start_bp"]]

 [1] "114/72" "132/86" "124/92" "144/83" "122/78" "121/80" "133/74" "116/73"
 [9] "118/66" "122/78" "126/82" "114/68" "118/73" "106/59" "112/69" "114/76"
[17] "124/80" "120/68" "119/77" "116/74" "121/80" "112/58" "117/67" "118/73"
[25] "116/74" "126/84" "144/96" "120/84" "115/75" "142/92"

df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ) |> fct_infreq(),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric(),
    start_na_clean = parse_number(start_na),
    treatment = fct_relevel(treatment, "upa", "uste", "oza")
    ) |>  
  separate_wider_delim(start_bp, delim ="/", names = c("bp_systolic", "bp_diastolic"), cols_remove = FALSE) |> 
  mutate(across(c(bp_systolic, bp_diastolic), as.numeric))

df_clean |> 
  select(start_bp, bp_systolic, bp_diastolic) |> 
  glimpse()

Rows: 30
Columns: 3
$ start_bp     <chr> "114/72", "132/86", "124/92", "144/83", "122/78", "121/80…
$ bp_systolic  <dbl> 114, 132, 124, 144, 122, 121, 133, 116, 118, 122, 126, 11…
$ bp_diastolic <dbl> 72, 86, 92, 83, 78, 80, 74, 73, 66, 78, 82, 68, 73, 59, 6…

mean(df_clean[["bp_systolic"]], na.rm = TRUE)

[1] 121.5333

mean(df_clean[["bp_diastolic"]], na.rm = TRUE)

[1] 76.36667

df_clean[["bp_systolic"]]

 [1] 114 132 124 144 122 121 133 116 118 122 126 114 118 106 112 114 124 120 119
[20] 116 121 112 117 118 116 126 144 120 115 142

df_clean[["bp_diastolic"]]

 [1] 72 86 92 83 78 80 74 73 66 78 82 68 73 59 69 76 80 68 77 74 80 58 67 73 74
[26] 84 96 84 75 92

Assigning labels

Problem
Solution 1
Solution 2
Solution 3
Confirm 1
Confirm 2

What does anything mean?

df_clean |> 
  select(pat_id, start_na, start_na_clean, pre_post_wt_kg, start_emo) |> 
  view()

# first import data dictionary
df_dictionary <- read_excel(
  path = here("data", "messy_uc.xlsx"),
  sheet = "Data_Dictionary"
)
df_dictionary

# A tibble: 33 × 5
   Variable       Details                                      Units Range ...5 
   <chr>          <chr>                                        <chr> <chr> <chr>
 1 pat_id         Patient Identifier                           digi… 001-… <NA> 
 2 treatment      Treatment for UC                             <NA>  upad… <NA> 
 3 start_date     Date of start of treatment                   digi… YYYY… <NA> 
 4 ethnic         Ethnicity - hispanic or not hispanic         <NA>  hisp… <NA> 
 5 race           Race - one of 7 choices                      <NA>  cauc… <NA> 
 6 dob            date of birth                                digi… YYYY… <NA> 
 7 start_bp       blood pressure at start - systolic/diastolic mm Hg syst… <NA> 
 8 pre/post_wt_kg weight in kilograms at start/end             kilo… 48-1… <NA> 
 9 start_mes      Mayo endoscopic Score at start of treatment  poin… 0-3   <NA> 
10 start_bss      Bowel symptom score at start                 poin… 0-100 QOL …
# ℹ 23 more rows

# second create a named vector of variable names and variable labels
vec_variables <- df_dictionary |> 
  select(Variable, Details) |> 
  deframe()

vec_variables

                                        pat_id 
                          "Patient Identifier" 
                                     treatment 
                            "Treatment for UC" 
                                    start_date 
                  "Date of start of treatment" 
                                        ethnic 
        "Ethnicity - hispanic or not hispanic" 
                                          race 
                     "Race - one of 7 choices" 
                                           dob 
                               "date of birth" 
                                      start_bp 
"blood pressure at start - systolic/diastolic" 
                                pre/post_wt_kg 
            "weight in kilograms at start/end" 
                                     start_mes 
 "Mayo endoscopic Score at start of treatment" 
                                     start_bss 
                "Bowel symptom score at start" 
                                     start_abd 
            "Abdominal symptom score at start" 
                                     start_sys 
             "Systemic symptom score at start" 
                                  start_coping 
                       "Coping score at start" 
                                     start_emo 
            "Emotional symptom score at start" 
                                      start_dl 
       "Impact on Daily living score at start" 
                                     start_wbc 
    "White blood cell count in blood at start" 
                                     start_plt 
            "Platelet count in blood at start" 
                                      start_na 
              "Sodium level in serum at start" 
                                       start_k 
           "Potassium level in serum at start" 
                                     end_month 
                          "Month of end visit" 
                                       end_day 
                            "Day of end visit" 
                                      end_year 
                           "Year of end visit" 
                                       end_mes 
   "Mayo endoscopic Score at end of treatment" 
                                       end_bss 
                  "Bowel symptom score at end" 
                                       end_abd 
              "Abdominal symptom score at end" 
                                       end_sys 
               "Systemic symptom score at end" 
                                    end_coping 
                         "Coping score at end" 
                                       end_emo 
              "Emotional symptom score at end" 
                                        end_dl 
         "Impact on Daily living score at end" 
                                       end_wbc 
      "White blood cell count in blood at end" 
                                       end_plt 
              "Platelet count in blood at end" 
                                        end_na 
                "Sodium level in serum at end" 
                                         end_k 
             "Potassium level in serum at end"

# assign labels to the data set
df_clean <- df_raw |>
  janitor::clean_names() |> 
  janitor::remove_empty(which = c("rows", "cols")) |> 
  mutate(
    ethnic_clean = case_when(
      ethnic %in%  c("hispanic", "Hispanic", "hispamnic") ~ "hispanic",
      ethnic %in%  c("NOT hispanic", "not hispanic") ~ "not hispanic",
    ) |> fct_infreq(),
    end_na_clean = na_if(end_na, -99),
    end_emo_clean = na_if(end_emo, "not done") |> as.numeric(),
    start_na_clean = parse_number(start_na),
    treatment = fct_relevel(treatment, "upa", "uste", "oza")
    ) |>  
  separate_wider_delim(start_bp, delim ="/", names = c("bp_systolic", "bp_diastolic"), cols_remove = FALSE) |> 
  mutate(across(c(bp_systolic, bp_diastolic), as.numeric)) |> 
  # assign labels to all variables from the codebook
  labelled::set_variable_labels(!!!vec_variables, .strict = FALSE) |> 
  # assign labels to new derived variables that did not exist in the code book
  labelled::set_variable_labels(
    ethnic_clean = "Ethnicity",
    start_na_clean = "Sodium level in serum at start",
    end_na_clean = "Sodium level in serum at end",
    end_emo_clean = "Emotional symptom score at end",
    bp_systolic = "Systolic blood pressure",
    bp_diastolic = "Diastolic blood pressure"
  )

# view entire data set
df_clean |> 
  select(pat_id, start_na, start_na_clean, pre_post_wt_kg, start_emo) |> 
  view()

🤔 Why doesn’t pre_post_wt_kg have a label?

# view structure of data frame
df_clean |>
   select(pat_id, start_na, start_na_clean, pre_post_wt_kg, start_emo) |> 
   str()

tibble [30 × 5] (S3: tbl_df/tbl/data.frame)
 $ pat_id        : chr [1:30] "001" "002" "003" "004" ...
  ..- attr(*, "label")= chr "Patient Identifier"
 $ start_na      : chr [1:30] "137mmol/L" "142mmol/L" "140mmol/L" "139mmol/L" ...
  ..- attr(*, "label")= chr "Sodium level in serum at start"
 $ start_na_clean: num [1:30] 137 142 140 139 144 145 142 138 140 137 ...
  ..- attr(*, "label")= chr "Sodium level in serum at start"
 $ pre_post_wt_kg: chr [1:30] "84/82" "77/77" "74/75" "66/65" ...
 $ start_emo     : num [1:30] 79.3 58.6 32.2 75.6 74.3 ...
  ..- attr(*, "label")= chr "Emotional symptom score at start"

Stage 2
Wrangling your Data

Where are We Now?

We have done DEV (Data Exploration and Validation) - with more validation to be done. Data validation is a continuous process, right up to Data Lock.
We discussed the importance of preventing data errors at the source - data collection and entry (preferably using a tool with data validation at entry - like REDCap)
We did Stage 1 cleaning - clean variable names, remove empty columns/rows, fix variable types/classes (characters in numeric, recoding factors, date madness), address missingness, violations of tidy data principles (separate), added meaningful variable labels.
While Stage 1 is required in nearly all projects, Stage 2 data cleaning is frequently needed, but not in all projects.

Stage 2 Data Cleaning

We will cover these frequent but optional topics:
- Restructuring data as Long or Wide format
- Thickening or padding Longitudinal Data
- Joining multiple datasets

Restructuring data to Long or Wide format

Wide Format - One Row per Patient

# A tibble: 6 × 8
  pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
  <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
1 001    upa               3       0      75.9    9.85      79.3    39.8
2 002    uste              2       1      54.7   32.7       58.6    49.4
3 003    oza               1       1      28.5   28.5       32.2    35.9
4 004    upa               3       1      77.2   33.2       75.6    50.2
5 005    oza               3       2      78.8   56.8       74.3    65.0
6 006    uste              2       1      54.4   32.4       45.3    36.0

Long Format - One Row per Measurement

# A tibble: 9 × 4
  pat_id treatment measure   score
  <chr>  <chr>     <chr>     <dbl>
1 001    upa       start_mes  3   
2 001    upa       end_mes    0   
3 001    upa       start_bss 75.9 
4 001    upa       end_bss    9.85
5 001    upa       start_emo 79.3 
6 001    upa       end_emo   39.8 
7 002    uste      start_mes  2   
8 002    uste      end_mes    1   
9 002    uste      start_bss 54.7

The Unit of Analysis - Wide by Patient

We may want to do an analysis by patient, as each patient may (or may not) have the outcome. If we have multiple observations, or data points on each patient, this leads to wide data, with one row per patient.

# A tibble: 6 × 8
  pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
  <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
1 001    upa               3       0      75.9    9.85      79.3    39.8
2 002    uste              2       1      54.7   32.7       58.6    49.4
3 003    oza               1       1      28.5   28.5       32.2    35.9
4 004    upa               3       1      77.2   33.2       75.6    50.2
5 005    oza               3       2      78.8   56.8       74.3    65.0
6 006    uste              2       1      54.4   32.4       45.3    36.0

The Unit of Analysis - Long by Visit

We are often interested in the change in an outcome over time.
To make this work, we need one row per measurement of the outcome. This leads to long data, with multiple visits and measurements for each patient.

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

Deciding on the Unit of Analysis

This Unit of Analysis usually depends on the Question we ask

Is the Unit of Analysis the Patient?

Did the patient die?
Did the patient have the outcome of colectomy?
Did the patient reach disease remission?

Is the Unit of Analysis the Visit/Encounter?

Often these are within-patient outcomes
- Did the C-reactive protein improve from Week 0 to Week 8?
- Did the sickle cell crises/year decrease after CRISPR gene Rx?
- Did the endoscopic score decrease on treatment vs placebo?

Deciding on the Unit of Analysis

You will most often use long data, and this data structure allows you to look at multiple predictors and outcomes, like blood pressure, a PHQ-9 depression questionnaire, and a hemoglobin measurement.
Depending on the analysis question, you may want to use wide data, and analyze by patient (usually with dichotomous outcomes).

Deciding on the Unit of Analysis

For inpatient data, you may have multiple measurements in the same visit or day, so we need to decide a priori on how to handle multiple observations of the same type (e.g., vitals q6h) in the same day.
- use the 0600 observation each day?
- use the daily average for SBP and DBP?
- use the max values each day?
Sometimes, you may want to do analysis on both long and wide data in the same project.
It is often helpful to be able to pivot your data between long and wide structures.

Reshaping your data with tidyr

The Problem
Wide Version of Data
Tall Version of Data

We often enter data by patient
Spreadsheets encourage us to enter longitudinal data as long rows (per patient)
We end up with wide, rather than tall data

# A tibble: 6 × 8
  pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
  <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
1 001    upa               3       0      75.9    9.85      79.3    39.8
2 002    uste              2       1      54.7   32.7       58.6    49.4
3 003    oza               1       1      28.5   28.5       32.2    35.9
4 004    upa               3       1      77.2   33.2       75.6    50.2
5 005    oza               3       2      78.8   56.8       74.3    65.0
6 006    uste              2       1      54.4   32.4       45.3    36.0

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

Reshaping your data with tidyr

R (and most R functions) are vectorized to handle tall data
One small observation per row
Most analyses in R are easier with tall data

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

Pivoting Longer (common)

We need to ‘pivot’ data from wide to tall on the regular
This “lengthens” data, increasing the number of rows, and decreasing the number of columns
We will be looking at Visit Dates (Start vs End) and Measures

Pivoting Longer

Arguments: data, cols, names_to, values_to, and many optional arguments
Details from the tidyverse help page are here
data = your dataframe/tibble - you can pipe this in
cols = columns to pivot, as a vector of names, or by number, or selected with tidyselect functions
names_to = A character vector specifying the new column or columns to create from the information stored in the column names of data specified by cols.
values_to = A string specifying the name of the column to create from the data stored in cell values.

Pivoting Longer (Example)

Let’s start with the wide version (selected columns from messy_uc)

# A tibble: 30 × 8
   pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
   <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
 1 001    upa               3       0      75.9    9.85      79.3    39.8
 2 002    uste              2       1      54.7   32.7       58.6    49.4
 3 003    oza               1       1      28.5   28.5       32.2    35.9
 4 004    upa               3       1      77.2   33.2       75.6    50.2
 5 005    oza               3       2      78.8   56.8       74.3    65.0
 6 006    uste              2       1      54.4   32.4       45.3    36.0
 7 007    uste              3       2      79.7   57.7       66.6    57.8
 8 008    oza               3       2      75.0   53.0       74.5    62.0
 9 009    upa               2       0      53.1    9.06      50.0    27.8
10 010    oza               3       1      78.9   34.9       72.5    45.9
# ℹ 20 more rows

Note that there are 30 rows, one per patient, with 6 measured quantities for each patient.

Pivoting Longer (Example)

This is the tall version we want to end up with.

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

Note that there now 180 rows (30*6), with one row per observation measure.

Doing the pivot_longer()

What values do we want for these key arguments in order to pivot_longer?

cols (which columns to pivot?)
names_to (variable to store the names)
values_to (variable to store the values)

# A tibble: 30 × 8
   pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
   <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
 1 001    upa               3       0      75.9    9.85      79.3    39.8
 2 002    uste              2       1      54.7   32.7       58.6    49.4
 3 003    oza               1       1      28.5   28.5       32.2    35.9
 4 004    upa               3       1      77.2   33.2       75.6    50.2
 5 005    oza               3       2      78.8   56.8       74.3    65.0
 6 006    uste              2       1      54.4   32.4       45.3    36.0
 7 007    uste              3       2      79.7   57.7       66.6    57.8
 8 008    oza               3       2      75.0   53.0       74.5    62.0
 9 009    upa               2       0      53.1    9.06      50.0    27.8
10 010    oza               3       1      78.9   34.9       72.5    45.9
# ℹ 20 more rows

Pivoting Longer In Action

Problem: Wide
Code: pivot_longer
Result: Tall

# A tibble: 30 × 8
   pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
   <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
 1 001    upa               3       0      75.9    9.85      79.3    39.8
 2 002    uste              2       1      54.7   32.7       58.6    49.4
 3 003    oza               1       1      28.5   28.5       32.2    35.9
 4 004    upa               3       1      77.2   33.2       75.6    50.2
 5 005    oza               3       2      78.8   56.8       74.3    65.0
 6 006    uste              2       1      54.4   32.4       45.3    36.0
 7 007    uste              3       2      79.7   57.7       66.6    57.8
 8 008    oza               3       2      75.0   53.0       74.5    62.0
 9 009    upa               2       0      53.1    9.06      50.0    27.8
10 010    oza               3       1      78.9   34.9       72.5    45.9
# ℹ 20 more rows

1wide |>
  pivot_longer(
    cols = "start_mes":"end_emo", 
2               names_to = "measure",
3               values_to = "score"
4    )

1: start with wide data, and pivot_longer
2: Which columns to pivot (names in quotes), could also use 3:8
3: which column (quotes) the pivoted variable names should go into
4: which column (quotes) the pivoted values should go into

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

Does this make sense so far?
Zoom reaction: “thumbs up” emoji 👍 if yes
- “raised hand” emoji ✋ if puzzled/questions

One Minor Issue - Separation of measure

Problem
Code
Result
Alternative within pivot_longer

the “measure” column combines a timepoint and the measure
Needs to be separated.
You already know how to use separate()
Arguments
- col
- sep
- into

tall |> 
  separate(col = "measure",
           sep = "_",
           into = c("timept", "measure")
           )

# A tibble: 180 × 5
   pat_id treatment timept measure score
   <chr>  <chr>     <chr>  <chr>   <dbl>
 1 001    upa       start  mes      3   
 2 001    upa       end    mes      0   
 3 001    upa       start  bss     75.9 
 4 001    upa       end    bss      9.85
 5 001    upa       start  emo     79.3 
 6 001    upa       end    emo     39.8 
 7 002    uste      start  mes      2   
 8 002    uste      end    mes      1   
 9 002    uste      start  bss     54.7 
10 002    uste      end    bss     32.7 
# ℹ 170 more rows

You can do this within pivot_longer with just one more argument
(if you read all of the pivot_longer documentation)

wide |> 
  pivot_longer(cols = 3:8,
    names_to = c("timept", "measure"),
    names_sep = "_",
    values_to = "score")

# A tibble: 180 × 5
   pat_id treatment timept measure score
   <chr>  <chr>     <chr>  <chr>   <dbl>
 1 001    upa       start  mes      3   
 2 001    upa       end    mes      0   
 3 001    upa       start  bss     75.9 
 4 001    upa       end    bss      9.85
 5 001    upa       start  emo     79.3 
 6 001    upa       end    emo     39.8 
 7 002    uste      start  mes      2   
 8 002    uste      end    mes      1   
 9 002    uste      start  bss     54.7 
10 002    uste      end    bss     32.7 
# ℹ 170 more rows

Pivoting Longer

Your Turn with endo_data
Measurements of Trans-Epithelial Electrical Resistance (TEER, the inverse of leakiness) in biopsies of 3 segments of intestine.
This could be affected by portal hypertension in patients with liver cirrhosis
Let’s find out!

Doing this At Home? Data load

Here is the code to load the data if you are doing this on a local computer. Use the clipboard icon at the top right to copy the code.

endo_data <- tibble::tribble(
  ~pat_id, ~portal_htn, ~duod_teer, ~ileal_teer, ~colon_teer,
  001, 1, 4.33, 14.57, 16.23,
  002, 0, 11.67, 15.99, 18.97,
  003, 1, 4.12, 13.77, 15.22,
  004, 1, 4.62, 16.37, 18.12,
  005, 0, 12.43, 15.84, 19.04,
  006, 0, 13.05, 16.23, 18.81,
  007, 0, 11.88, 15.72, 18.31,
  008, 1, 4.87, 16.59, 18.77,
  009, 1, 4.23, 15.04, 16.87,
  010, 0, 12.77, 16.73, 19.12
)
endo_data

# A tibble: 10 × 5
   pat_id portal_htn duod_teer ileal_teer colon_teer
    <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
 1      1          1      4.33       14.6       16.2
 2      2          0     11.7        16.0       19.0
 3      3          1      4.12       13.8       15.2
 4      4          1      4.62       16.4       18.1
 5      5          0     12.4        15.8       19.0
 6      6          0     13.0        16.2       18.8
 7      7          0     11.9        15.7       18.3
 8      8          1      4.87       16.6       18.8
 9      9          1      4.23       15.0       16.9
10     10          0     12.8        16.7       19.1

Exercise PH1

Complete Data Cleaning Fundamentals Exercise PH1.

Do This in Posit Cloud
If you have the exercise done correctly, click on the Reactions tab in Zoom, and click to put the “thumbs up” emoji 👍 on your screen.
If you are having trouble, click on the Reactions tab in Zoom, and click to put the “raised hand” emoji ✋ on your screen.

–> Take me to the exercise Solution <–

03:00

Pivoting Longer with endo_data

Dataset
Arguments
Code
Solution
Result

# A tibble: 10 × 5
   pat_id portal_htn duod_teer ileal_teer colon_teer
    <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
 1      1          1      4.33       14.6       16.2
 2      2          0     11.7        16.0       19.0
 3      3          1      4.12       13.8       15.2
 4      4          1      4.62       16.4       18.1
 5      5          0     12.4        15.8       19.0
 6      6          0     13.0        16.2       18.8
 7      7          0     11.9        15.7       18.3
 8      8          1      4.87       16.6       18.8
 9      9          1      4.23       15.0       16.9
10     10          0     12.8        16.7       19.1

What values do you want to use for these arguments to pivot_longer:
- cols
- names_pattern = “(.+)_teer”
- names_to
- values_to
Note that we are giving you the correct value for names_pattern, which will ask for what we want - to keep the characters of the name (of whatever length) before “_teer”

Fill in the blanks to pivot this dataset to tall format, with columns for the intestinal location and the teer value.
Note that we are giving you the correct answer for the names_pattern argument.

endo_data |> 
  pivot_longer(
    cols = ___ ,
    names_pattern = "(.+)_teer",
    names_to =  ___ ,
    values_to = ___
  )

Fill in the blanks to pivot this dataset to tall format, with columns for the intestinal location and the teer value.

endo_data |> 
  pivot_longer(
    cols = "duod_teer":"colon_teer",
    names_pattern = "(.+)_teer",
    names_to = c("location"),
    values_to = "teer"
  )

Run the code, and look at the resulting table. Use the clipboard icon at the top right to copy the code.

# A tibble: 30 × 4
   pat_id portal_htn location  teer
    <dbl>      <dbl> <chr>    <dbl>
 1      1          1 duod      4.33
 2      1          1 ileal    14.6 
 3      1          1 colon    16.2 
 4      2          0 duod     11.7 
 5      2          0 ileal    16.0 
 6      2          0 colon    19.0 
 7      3          1 duod      4.12
 8      3          1 ileal    13.8 
 9      3          1 colon    15.2 
10      4          1 duod      4.62
# ℹ 20 more rows

Do you think that portal hypertension has an effect on TEER and (its inverse) epithelial leakiness?

Pivoting Wider

Tall messy_uc Data
Code to Pivot Wider
Wider Result

Wide data is less common, but sometimes needed for per-patient analysis
Here we will convert the tall version of our selected messy_uc data back to wide.
This is what the tall data look like

# A tibble: 180 × 4
   pat_id treatment measure   score
   <chr>  <chr>     <chr>     <dbl>
 1 001    upa       start_mes  3   
 2 001    upa       end_mes    0   
 3 001    upa       start_bss 75.9 
 4 001    upa       end_bss    9.85
 5 001    upa       start_emo 79.3 
 6 001    upa       end_emo   39.8 
 7 002    uste      start_mes  2   
 8 002    uste      end_mes    1   
 9 002    uste      start_bss 54.7 
10 002    uste      end_bss   32.7 
# ℹ 170 more rows

tall |> 
  pivot_wider(
    id_cols = c(pat_id, treatment), # Variables not pivoted
    names_from = measure, # will become column names
    values_from = score # will become values
  )

# A tibble: 30 × 8
   pat_id treatment start_mes end_mes start_bss end_bss start_emo end_emo
   <chr>  <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
 1 001    upa               3       0      75.9    9.85      79.3    39.8
 2 002    uste              2       1      54.7   32.7       58.6    49.4
 3 003    oza               1       1      28.5   28.5       32.2    35.9
 4 004    upa               3       1      77.2   33.2       75.6    50.2
 5 005    oza               3       2      78.8   56.8       74.3    65.0
 6 006    uste              2       1      54.4   32.4       45.3    36.0
 7 007    uste              3       2      79.7   57.7       66.6    57.8
 8 008    oza               3       2      75.0   53.0       74.5    62.0
 9 009    upa               2       0      53.1    9.06      50.0    27.8
10 010    oza               3       1      78.9   34.9       72.5    45.9
# ℹ 20 more rows

Longitudinal Data

Another set of data issues you may face is in data collected over time, when one of two things happen:
- You want to analyze data by day or month or year, and data in your Electronic Medical Record is collected and time-stamped by the second.
- You realize that some observations (on weekends) are missing, and you need to fill these dates in as missing, but you really don’t want to do this by hand.
The {padr} package can help with these issues.

Thicken Date-Times with padr

# A tibble: 6 × 2
  title                   time_stamp         
  <chr>                   <dttm>             
1 EMS: BACK PAINS/INJURY  2015-12-10 17:40:00
2 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00
3 Fire: GAS-ODOR/LEAK     2015-12-10 17:40:00
4 EMS: CARDIAC EMERGENCY  2015-12-10 17:40:01
5 EMS: DIZZINESS          2015-12-10 17:40:01
6 EMS: HEAD INJURY        2015-12-10 17:40:01

The emergency data set in the {padr} package contains > 120K emergency calls from Montgomery County, PA over a period of ~ 11 months.
Each call has a title and a timestamp
We want to know in which months the emergency department should order an extra case of Narcan
(H: summer months?).

Thickening Time to a Usable Level

Starting Point
Original Data
Thickened to Month
Thickened to Week

The thicken function adds a column to a data frame that is of a higher interval than the original variable.
The intervals for {padr} are year, quarter, month, week, day, hour, min, and sec.
The variable time_stamp has the interval of seconds
We can thicken the data to the time interval we need.
Then we can count events by a usable unit of time

emergency |> 
  head()

# A tibble: 6 × 6
    lat   lng   zip title                   time_stamp          twp             
  <dbl> <dbl> <int> <chr>                   <dttm>              <chr>           
1  40.3 -75.6 19525 EMS: BACK PAINS/INJURY  2015-12-10 17:40:00 NEW HANOVER     
2  40.3 -75.3 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSH…
3  40.1 -75.4 19401 Fire: GAS-ODOR/LEAK     2015-12-10 17:40:00 NORRISTOWN      
4  40.1 -75.3 19401 EMS: CARDIAC EMERGENCY  2015-12-10 17:40:01 NORRISTOWN      
5  40.3 -75.6    NA EMS: DIZZINESS          2015-12-10 17:40:01 LOWER POTTSGROVE
6  40.3 -75.3 19446 EMS: HEAD INJURY        2015-12-10 17:40:01 LANSDALE

We will thicken to month

emergency |> 
  thicken('month') |> 
  head() |> 
  select(-lat, -lng, -zip)

# A tibble: 6 × 4
  title                   time_stamp          twp               time_stamp_month
  <chr>                   <dttm>              <chr>             <date>          
1 EMS: BACK PAINS/INJURY  2015-12-10 17:40:00 NEW HANOVER       2015-12-01      
2 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP 2015-12-01      
3 Fire: GAS-ODOR/LEAK     2015-12-10 17:40:00 NORRISTOWN        2015-12-01      
4 EMS: CARDIAC EMERGENCY  2015-12-10 17:40:01 NORRISTOWN        2015-12-01      
5 EMS: DIZZINESS          2015-12-10 17:40:01 LOWER POTTSGROVE  2015-12-01      
6 EMS: HEAD INJURY        2015-12-10 17:40:01 LANSDALE          2015-12-01

We will thicken to week

emergency |> 
  thicken('week') |> 
  head() |> 
  select(-lat, -lng, -zip)

# A tibble: 6 × 4
  title                   time_stamp          twp               time_stamp_week
  <chr>                   <dttm>              <chr>             <date>         
1 EMS: BACK PAINS/INJURY  2015-12-10 17:40:00 NEW HANOVER       2015-12-06     
2 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP 2015-12-06     
3 Fire: GAS-ODOR/LEAK     2015-12-10 17:40:00 NORRISTOWN        2015-12-06     
4 EMS: CARDIAC EMERGENCY  2015-12-10 17:40:01 NORRISTOWN        2015-12-06     
5 EMS: DIZZINESS          2015-12-10 17:40:01 LOWER POTTSGROVE  2015-12-06     
6 EMS: HEAD INJURY        2015-12-10 17:40:01 LANSDALE          2015-12-06

Thickening Time for a Monthly Plot

Goal
Code
Result
Plot Monthly

The thicken function adds a column to a data frame that is of a higher interval than the original variable.
The variable time_stamp has the interval of seconds
We want to thicken this (time stamped) data to month, then select and count overdoses
This will set up the monthly overdose plot we want.

We will thicken to month
Then count overdoses by month

emergency |> 
  thicken('month') |> 
  group_by(time_stamp_month) |> 
  summarize(overdoses = sum(str_detect(title, "OVERDOSE"))) |> 
    select(time_stamp_month, overdoses)

This lets us count events like overdoses by month with time_stamp_month.

# A tibble: 11 × 2
   time_stamp_month overdoses
   <date>               <int>
 1 2015-12-01              91
 2 2016-01-01             135
 3 2016-02-01             141
 4 2016-03-01             161
 5 2016-04-01             124
 6 2016-05-01             149
 7 2016-06-01             140
 8 2016-07-01             147
 9 2016-08-01             137
10 2016-09-01             171
11 2016-10-01              95

When would you order extra cases of Narcan?

Padding unobserved dates (weekends?)

The pad() function allows you to fill in missing intervals.
As an example, my hospital only runs fecal calprotectin tests on weekdays.
This can lead to weird discontinuities in data over a weekend (Dec 3-4).
No observations on weekend days/holidays.

# A tibble: 14 × 3
   pat_id date         fcp
   <chr>  <date>     <dbl>
 1 001    2022-12-01  1574
 2 001    2022-12-02  1323
 3 001    2022-12-05   673
 4 001    2022-12-06   314
 5 001    2022-12-07   168
 6 002    2022-11-30  1393
 7 002    2022-12-01  1014
 8 002    2022-12-02   812
 9 002    2022-12-05   247
10 002    2022-12-06   118
11 003    2022-12-02   987
12 003    2022-12-05   438
13 003    2022-12-06   312
14 003    2022-12-05   194

Padding Unobserved Times

The Problem
The Code
The Result

We can fill in (pad) the unobserved weekend days with the pad() function.

fcp |> 
  pad(group = "pat_id") |> # this adds lines for each missing day to the data
  # by patient
  print(n = 12)

# A tibble: 21 × 3
   pat_id date         fcp
   <chr>  <date>     <dbl>
 1 001    2022-12-01  1574
 2 001    2022-12-02  1323
 3 001    2022-12-03    NA
 4 001    2022-12-04    NA
 5 001    2022-12-05   673
 6 001    2022-12-06   314
 7 001    2022-12-07   168
 8 002    2022-11-30  1393
 9 002    2022-12-01  1014
10 002    2022-12-02   812
11 002    2022-12-03    NA
12 002    2022-12-04    NA
# ℹ 9 more rows

New observations are created on the missing dates
NAs are filled in for the missing FCPs, with one for each day and group (pat_id)
pad() fills in the missing pat_ids

Joining Data

Another data issue you will often face is that you have two interesting data sets, and that they would be more interesting if you could link the data in one to the data in the other.

Better Together:
Chocolate and Peanut Butter
(Datasets)

Joins of data from different sources

We often collect data from different sources that we later want to join together for analysis
- Data from your local Electronic Medical Record
- Data from the CDC
- Data from the US Census
External data can illuminate our understanding of our local patient data

Local Demographics with CDC SVI data

The Problem
The Data
The Code
The Solution
The Result

We have 2 datasets, one local Demographics and Census Tract, and one from the CDC that has values for Social Vulnerability Index by Census Tract
We want to know if the SVI for the neighborhood of each patient influences health outcomes
We need to left_join these datasets together by matching on the Census Tract

What is the common uniqid/key?

Local EMR data (demo)

# A tibble: 9 × 4
  pat_id name                 htn census_tract
  <chr>  <chr>              <dbl>        <dbl>
1 001    Arthur Blankenship     0  26161404400
2 002    Britney Jonas          0  26161405100
3 003    Sally Davis            1  26161402100
4 004    Al Jones               0  26161403200
5 005    Gary Hamill            1  26161405200
6 006    Ken Bartoletti         0  26161404500
7 007    Ike Gerhold            0  26161405600
8 008    Tatiana Grant          0  26161404300
9 009    Antione Delacroix      1  26161405500

cdc

# A tibble: 9 × 2
  census_tract   svi
         <dbl> <dbl>
1  26161404400  0.12
2  26161405100  0.67
3  26161402100  0.43
4  26161403200  0.07
5  26161405200  0.71
6  26161404500  0.23
7  26161405600  0.27
8  26161404300  0.21
9  26161405500  0.62

Replace the generic arguments to the left_join function to join the demographic data (demo) on the left to add Social Vulnerability index (svi) from the cdc dataset on the RHS. Left join by census tract.

left_join(dataset_x, dataset_y, by = "key")

left_join(demo, cdc, by = "census_tract")

Now that the join is complete, you can ask your question:
Is there an association between social vulnerability and hypertension?

# A tibble: 9 × 5
  pat_id name                 htn census_tract   svi
  <chr>  <chr>              <dbl>        <dbl> <dbl>
1 001    Arthur Blankenship     0  26161404400  0.12
2 002    Britney Jonas          0  26161405100  0.67
3 003    Sally Davis            1  26161402100  0.43
4 004    Al Jones               0  26161403200  0.07
5 005    Gary Hamill            1  26161405200  0.71
6 006    Ken Bartoletti         0  26161404500  0.23
7 007    Ike Gerhold            0  26161405600  0.27
8 008    Tatiana Grant          0  26161404300  0.21
9 009    Antione Delacroix      1  26161405500  0.62

Patient Demographics with Lab results (Your Turn to Join)

We have some basic Patient Demographics in one table (demo)

# A tibble: 9 × 3
  pat_id name                 age
  <chr>  <chr>              <dbl>
1 001    Arthur Blankenship    67
2 002    Britney Jonas         23
3 003    Sally Davis           63
4 004    Al Jones              44
5 005    Gary Hamill           38
6 006    Ken Bartoletti        33
7 007    Ike Gerhold           52
8 008    Tatiana Grant         42
9 009    Antione Delacroix     27

and potassium levels and creatinine levels in 2 other tables (pot and cr)

# A tibble: 5 × 2
  pat_id     k
  <chr>  <dbl>
1 001      3.2
2 002      3.7
3 003      4.2
4 004      4.4
5 005      4.1

# A tibble: 5 × 2
  pat_id    cr
  <chr>  <dbl>
1 001      0.2
2 002      0.5
3 003      0.9
4 004      1.5
5 005      0.7

Need to Load the Data?

If you are trying this on your local computer, copy the code below with the clipboard icon to get the data into your computer.

demo <- tibble::tribble(
  ~pat_id, ~name, ~age,
  '001', "Arthur Blankenship", 67,
  '002', "Britney Jonas", 23,
  '003', "Sally Davis", 63,
  '004', "Al Jones", 44,
  '005', "Gary Hamill", 38,
  '006', "Ken Bartoletti", 33,
  '007', "Ike Gerhold", 52,
  '008', "Tatiana Grant", 42,
  '009', "Antoine Delacroix", 27,
)

pot <- tibble::tribble(
  ~pat_id, ~k,
  '001', 3.2,
  '002', 3.7,
  '003', 4.2,
  '004', 4.4,
  '005', 4.1,
  '006', 4.0,
  '007', 3.6,
  '008', 4.2,
  '009', 4.9,
)

cr <- tibble::tribble(
  ~pat_id, ~cr,
  '001', 0.2,
  '002', 0.5,
  '003', 0.9,
  '004', 1.5,
  '005', 0.7,
  '006', 0.9,
  '007', 0.7,
  '008', 1.0,
  '009', 1.7,
)

Your Turn to Join

We want to join the correct labs (9 rows each from the pot and cr datasets) to the correct patients.
The unique identifier (called the uniqid or key or recordID) is pat_id.
- It only occurs once for each patient/row
- It appears in each table we want to join
- The pat_id is of the character type in each (a common downfall if one is character, one is numeric, but they look the same - but don’t match)
We want to start with demographics, then add datasets that match to the right.
We will use demo as our base dataset on the left hand side (LHS), and first join the potassium (pot) results (RHS)

What the Left Join Looks Like

This is a mutating join - new variables from y are created/added to the LHS (x).

Exercise PH2

Complete the Data Cleaning Fundamentals Exercise PH2.

Do This in Posit Cloud
If you have the exercise done correctly, click on the Reactions tab in Zoom, and click to put the “thumbs up” emoji 👍 on your screen.
If you are having trouble, click on the Reactions tab in Zoom, and click to put the “raised hand” emoji ✋ on your screen.

–> Take me to the exercise Solution <–

03:00

Your Turn to Join

The Problem
The Code
The Solution
The Result

Joining demo to pot with a left_join
left_join(data_x, data_y, by = "uniqid")

replace the generic arguments below with the correct ones to join demo to pot and produce new_data.

new_data <- left_join(data_x, data_y, by = "uniqid")
new_data

Note the syntax for identifying the uniqid on which to do the merge: by = "varname"

Code

new_data <- left_join(demo, pot, by = "pat_id") 
new_data

# A tibble: 9 × 4
  pat_id name                 age     k
  <chr>  <chr>              <dbl> <dbl>
1 001    Arthur Blankenship    67   3.2
2 002    Britney Jonas         23   3.7
3 003    Sally Davis           63   4.2
4 004    Al Jones              44   4.4
5 005    Gary Hamill           38   4.1
6 006    Ken Bartoletti        33   4  
7 007    Ike Gerhold           52   3.6
8 008    Tatiana Grant         42   4.2
9 009    Antoine Delacroix     27   4.9

Now add Creatinine (cr) to new_data

The Problem
The Code
The Solution
The Result

Joining new_data and cr with a left_join
left_join(data_x, data_y, by = “uniqid”)

Replace the generic arguments with the correct ones to join
new_data and cr and produce new_data2.

new_data2 <- left_join(data_x, data_y, by = "uniqid")
new_data2

Code

new_data2 <- left_join(new_data, cr, by = "pat_id")
new_data2

# A tibble: 9 × 5
  pat_id name                 age     k    cr
  <chr>  <chr>              <dbl> <dbl> <dbl>
1 001    Arthur Blankenship    67   3.2   0.2
2 002    Britney Jonas         23   3.7   0.5
3 003    Sally Davis           63   4.2   0.9
4 004    Al Jones              44   4.4   1.5
5 005    Gary Hamill           38   4.1   0.7
6 006    Ken Bartoletti        33   4     0.9
7 007    Ike Gerhold           52   3.6   0.7
8 008    Tatiana Grant         42   4.2   1  
9 009    Antoine Delacroix     27   4.9   1.7

Al has HTN and DM2
Antoine has early stage FSGS

Workhorse Joins

left_join is your workhorse. Start with patient identifiers/uniqid and add data to the right side.
Sometimes you will need to wrangle/process incoming data, then pipe it in to right_join it to the patient demographics

Fancy Joins

There are multiple kinds of fancy joins (semi_join, anti_join, inner_join, full_join, union, intersect, setdiff) which come in handy once in a while.
The many kinds of joins are all well explained here

Subset your data (x), keeping only the ones (y=hospitalized) who were hospitalized.

Subset your data (x), keeping only the ones (y = dead) who are NOT dead.

The Newest Fancy Joins

You can use join_by for more joining options, including
- inequality - only match if > value
- closest date
- closest date as long it is after the comparison date, but within 30 days
- only match if interval/date overlaps
- match value if it is within a range
Learn more at Join Specifications

Data Horror

Data So Messy, it Constitutes a Data Crime

Messy

What Should You Do?

The Horror! The Horror!

Two Options

You can find the person who collected/entered the data

If you don’t correct this behavior now, this will torture many future data analysts
Educate in this teachable moment about tidy data
Send them to watch Tidy Spreadsheets on YouTube here to prevent this kind of Data Crime in the future.
Improve the world, one (now tidy) data collector at a time.

There is no way to find the person who collected/entered the data

Pull out some advanced data cleaning packages made for this particular kind of mess
- {unpivotr}
- {tidyxl}
- {unheadr}
Watch the unpivotr/tidyxl 14 min video here
Learn from a free e-book, Spreadsheet Munging Strategies

Step by Step Cleaning - Stage 1

Import your data cleanly - skip rows, get headers
Explore your data
Remove empty rows & columns
Fix variable classes/types, check dates
Clean variable names
Create a codebook
Validate for missingness, outliers
Fix bad category values
Identify and consistently label missing values
Separate mixed (SBP/DBP) values, assign variable labels

Step by Step Cleaning - Stage 2 Wrangling

Let your Question define your Unit of Analysis
Pivot_longer or pivot_wider if needed
Thicken time, pad missing times if needed
Join to add new data if needed
Know what packages to use for data horrors
- unpivotr
- tidyxl
- unheadr

Thank you!

Thank you for joining the workshop!

All materials will live on here: https://shannonpileggi.github.io/rmedicine-data-cleaning-2023/

Cleaning Medical Data with R

Introduction

Licensing

Instructors

Scope

Schedule

Exercises

Syntax aside

Pipes

R for Data Science: Ch 18 Pipes

Namespacing

Principles of Data Management

Data Horror Stories

Data Horror Stories

Data Organizing Principles

Data Structure

Data Structure

Exercise

Data Structure

Variable Values

Exercise

Variable Values

Variable Types

Variable Types

Exercise

Exercise

Variable Types

Missing Data

Exercise

Missing Data

Error Reduction

Plan the variables you want to collect

Add those variables to a data dictionary

Build your tools based on your data dictionary

Build your tools based on your data dictionary

Test your data collection or entry tool

Review your data often during data collection or entry

pointblank report

Review your data often during data collection

codebookr codebook

Data Cleaning Practices

Scenario

Exercise

Import our file

Import our file

Exercise (CL1)

Review the data

Review the data

summarytools::dfSummary()

skimr::skim()

Exercise (CL2)

Stage 1 Data Cleaning

Variable names

Variable names, cleaner

Remove empty columns or rows

Recoding

Exercise

Replace values with missing

Incorrect variable type

Correcting dates

Extracting numbers from text

Exercise

Character variable should be a factor

Exercise

Separating values

Assigning labels

Stage 2 Wrangling your Data

Where are We Now?

Stage 2 Data Cleaning

Restructuring data to Long or Wide format

The Unit of Analysis - Wide by Patient

The Unit of Analysis - Long by Visit

Deciding on the Unit of Analysis

Deciding on the Unit of Analysis

Deciding on the Unit of Analysis

Reshaping your data with tidyr

Reshaping your data with tidyr

Pivoting Longer (common)

Pivoting Longer

Pivoting Longer (Example)

Cleaning Medical Data
with R

`pointblank` report

`codebookr` codebook

`summarytools::dfSummary()`

`skimr::skim()`

Stage 1
Data Cleaning

Stage 2
Wrangling your Data