Most datasets produced by others will require some quality control and manipulation to get them into a format that’s useful for your analysis. It’s really easy to get lost in this step -- it can be quite a challenge! Many researchers don’t think too much about data wrangling except while they’re actually doing the data manipulation. So it’s not uncommon to find data that needs some attention before it can be reused.
For your work to be reproducible, it’s really important to think about data wrangling details up front. There are LOTS of ways to approach wrangling other people’s data. It’s actually what a good portion of the data science field was built on. The main things you need to think about are good documentation and automation. Good documentation in a DATA_README file allows a future researcher to follow your steps and repeat the manipulations you made of the data:
- Did you remove outlying data points?
- What decision rules did you use to identify them?
- Did you combine sub-samples or treatments?
- Did you have to correct errors in the data?
- What assumptions did you make? ...and so on
Operations like data cleaning and error correction can be done by hand in a spreadsheet program. However, if you’re manipulating large amounts of data or need to do the same manipulations to multiple data sets, consider automating the analysis by writing a computer script. Scripts are great because they both record everything that was done to the data, and can be reused to re-clean the data, if necessary.