Large data projects in R require consistent work flow principles. The goal is to improve project management. Above all, the work flow process has a clear priority: to shift time spent from low to high value activity. The solution is simple: (1) use a basic project template to manage project files and directories; (2) write code with a common set of tools to improve code flow and efficiency, and (3) extend base R with well accepted R packages that support improved work flow and code syntax.
The result can be coded projects intended to be read by people and not just machines, improving both execution and long-term maintenance.
First, an essential ingredient of data science is data management. Data cleaning and preparation dominate data analysis. 1 Data prep is not a just a first step. Indeed, daily work and production processes repeat data prep and transformations many times. As a result, the concept of “tidy data” has been adopted by data science and industry for consistent data cleaning.2
Second, the work flow process must combine tidy data (data transformations) with nimble data (data juggling). Nimble data is nothing more than simple techniques to manage data analysis tasks efficiently. In base R, nimble data analysis avoids use of for() or while() loops and favors use of apply() functions. apply() functions are unique to R. Specifically, they are “vectorized operations” to split data objects into chunks, to apply a function to analyze each chunk and then to combine data output into a result. Coding techniques for “split-apply-combine” strategies are core to the work flow presented. 3
Third, work flow data reporting is rationalized for improved effectiveness using RMarkdown. RMarkdown documents, for example, are fully reproducible data analysis tools that combine text narration with embedded data and source code. Publication quality documents are easily achieved using RMarkdown and simple templates. In fact, RMarkdown supports 50 document output formats including Word, PDF and HTML documents to name just a few.
Finally, the work flow embraces the use of Git and GitHub. Git is local software for version control. Github is an online storage archive with project management tools. Together, Git and GitHub are ideal solutions for warehousing project data, source code and documents.
Work Flow Links
The work flow process relies on a consistent grammar of data for repeatable research and repeatable results across machines and over time. The work flow framework has the following components:
- Project Directory Template
- Principles of Tidy Data
- Tidy Data Preparation in R
- Tidy Data Transformations in R
- Split-Apply-Combine Techniques
- Project Reporting Template
- GitHub Data and Code Control
- Dasu, Johnson, Exploratory Data Mining and Data Cleaning, Wiley-IEEE, 2003. ↩
- Wickham, Hadley, “Tidy Data,” Journal of Statistical Software, Volume 59, Issue 10, September 2014. ↩
- Wickham, “The Split-Apply-Combine Strategy for Data Analysis,” Journal of Statistical Software, Volume 40 Issue 1, April 2011. ↩