I recently came across a set of data cleaning tips in Excel from EvaluATE, which provides support for people looking to improve their evaluation practice.


Screenshot of the Excel Data Cleaning Tips

As I looked through the tips, I realized that I could show how to do each of the five tips listed in the document in R. Many people come to R from Excel so having a set of R to Excel equivalents (also see this post on a similar topic) is helpful.

The tips are not intended to be comprehensive, but they do show some common things that people do when cleaning messy data. I did a live stream recently where I took each tip listed in the document and showed its R equivalent.

As I mention at the end of the video, while you can certainly do data cleaning in Excel, switching to R enables you to make your work reproducible. Say you have some surveys that need cleaning today. You write your code and save it. Then, when you get 10 new surveys next week, you can simply rerun your code, saving you countless Excel points and clicks.

You can watch the full video at the very bottom or go each tip by using the videos immediately below. I hope it’s helpful in giving an overview of data cleaning in R!

Tip #1: Identify all cells that contain a specific word or (short) phrase in a column with open-ended text

Tip #2: Identify and remove duplicate data

Tip #3: Identify the outliers within a data set

Tip #4: Separate data from a single column into two or more column

Tip #5: Categorize data in a column, such as class assignments or subject groups

Full Video

*This is a Repost of David Keyes’ blog Data Cleaning Tips in R

About the Authors

David Keyes

David Keyes box with arrow

Founder, R for the Rest of Us

David Keyes has over a decade of experience conducting research and evaluation. He has led the Mexican Migration Field Research and Training Program at the University of California, San Diego; conducted evaluation work as part of the Oregon Community Foundation research team; and served as a data visualization consultant to other researchers and evaluators. In recent years, David has also trained evaluators (and others) to use R—the most powerful tool for data analysis and visualization—as the founder of R for the Rest of Us.

Creative Commons

Except where noted, all content on this website is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Related Blog Posts

Nation Science Foundation Logo EvaluATE is supported by the National Science Foundation under grant numbers 0802245, 1204683, 1600992, and 1841783. Any opinions, findings, and conclusions or recommendations expressed on this site are those of the authors and do not necessarily reflect the views of the National Science Foundation.