Cleaning Medical Data with R

R/Medicine 2023 Workshop


πŸ—“ June 6, 2023 | 11:00am - 2:00pm EDT

🏨 Virtual

πŸ’₯ FREE workshop registration with registration for the R/Medicine 2023 meeting.

πŸ“Ή Recording available on YouTube (~3 hours)


Overview

In this workshop you will learn how to import messy data from an Excel spreadsheet, and develop the R skills to turn this mess into tidy data ready for analysis.

Learning objectives

  • The acquire a foundational understanding of how data should be organized.

  • To import Excel files with (common, messy) data problems.

  • To address and clean common messy data problems in each variable.

  • To address and clean data with more complex meta-problems, like pivoting to long format for data analysis, thickening timestamped data to manageable time units, padding missing dates, and joining different datasets together.

Is this course for me?

If your answer to any of the following questions is β€œyes”, then this is the right workshop for you.

  • Do you frequently receive untidy data for analysis in Excel spreadsheets?

  • Does this drive you slightly batty?

  • Do you want to learn how to make your life easier with a suite of data cleaning tools?

The workshop is designed for those with some experience in R. It will be assumed that participants can perform basic data manipulation. Experience with the {tidyverse} and the %>%/|> operator is a major plus, but is not required.

Prework

This workshop will largely be conducted in the Posit Cloud environment. Please create a login to the Posit Cloud instance of this workshop here:

bit.ly/posit-cloud-cleaning-medical-data.

If you will not be using the Posit Cloud instance, you can just watch along as the instructors teach.

As a more high-risk alternative, if you would like to try this out on your own computer, please have the following installed and configured on your machine. Note that we will NOT be able to help you with local computer problems during the workshop. If you choose to do this, you are on your own.

  • Recent version of R

  • Recent version of RStudio

  • Recent version of packages used in this workshop.

instll_pkgs <- c("tidyverse", "tidyxl", "unpivotr", "readxl", "openxlsx", "janitor", 
                "gtsummary", "here", "padr",  "here", "skimr", "visdat", "summarytools", 
                "DataExplorer", "Hmisc", "pointblank", "codebookr", "codebook", 
                "memisc", "sjPlot")
install.packages(instll_pkgs)
  • Ensure you can knit R markdown documents

  • Open RStudio and create a new Rmarkdown document

  • Save the document and check you are able to knit it.

  • Ensure you can import data from a Microsoft Excel spreadsheet to R

    • Save a a Microsoft Excel spreadsheet in a known folder on your computer.
    • Open RStudio install/load (library) the packages {readxl} (link) and {openxlsx} (link)
    • Import the data from the spreadsheet into R using the path to the correct folder with either package.

Instructors

Headshot of Crystal Lewis

Crystal Lewis (she/her) is a Freelance Research Data Management Consultant, helping people better understand how to organize, document, and share their data. She has been wrangling data in the field of Education for over 10 years. She is also a co-organizer for R-Ladies St. Louis.



Headshot of Shannon Pileggi

Shannon Pileggi (she/her) is a Lead Data Scientist at The Prostate Cancer Clinical Trials Consortium, a frequent blogger, and a member of the R-Ladies Global team. She enjoys automating data wrangling and data outputs, and making both data insights and learning new material digestible.



Headshot of Peter Higgins

Peter Higgins (he/him) is a Professor at the University of Michigan, a translational researcher interested in reproducible research, and the creator of the {medicaldata} package. He enjoys clean and tidy data when he (rarely) encounters it in the wild.



Further Reading/Viewing:

  • Crystal has an excellent (work in progress) e-book on Data Management in Large-Scale Education Research here.
  • Peter shared his strong feelings about the limitations of Excel as a data entry tool (and how you can use Excel better) in an R/Medicine workshop/YouTube video here.
  • Peter is working on a free e-book (work in progress) on reproducible medical data analysis here and a package of clean (and messy) medical data for practicing and teaching in R here.
  • Our friend Duncan Garmonsway has a free e-book on advanced data cleaning strategies for especially devilish colored, pivoted, or bizarrely structured data here