Project Control with Git and GitHub

Introduction

GitHub is a web-based hosting service for file archiving and version control.  Github uses the locally installed software tool Git. Data science projects use Git and GitHub to provide access and control of project data, source code and narrative text files. In practice, RStudio provides Git.

Benefits

Version control is an essential features of any project and the benefits are simple:

  • Collaboration: Provide a central repository of files for collaboration;
  • Accessibility: Any machine can access files from any location;
  • Synchronization: Local data and file versions synchronize with the latest updates;
  • Storage efficiency: Only changes are stored;
  • Versioning: Manage progress with standardized milestone or version tracking;
  • Error correction: Old files and innovations replace mistakes;
  • Experimentation: Explore scenarios without impact to production data or code;
  • Task management: Keep track of pending tasks, new requests and source code bugs;
  • Documentation: Project wikis orient users on active or dormant project details;
  • Professionalism: Adopt work flow tools used in many development environments.

Getting Started

Open a GitHub Account

Go to https://github.com.  GitHub offers at no cost an unlimited number of public repositories that have limited features and allow up to 3 collaborators. Public repos are searchable.  Anybody can view and download project contents.  GitHub members can set up 3 private repos at no cost.  A personal plan with unlimited private repos costs $7 / month.  Current plans, plan options and pricing are detailed here.

Install Git

Install Git prior to using it with RStudio, using the appropriate method by platform:

An excellent resource for learning more about Git and how to use it is the Pro Git online book. Another good resource for learning about git is the Git Bootcamp provided by GitHub.

Setup Git in RStudio and Link to GitHub

To activate Git, first confirm the install was successful.  Locate the Git executable file in the shell terminal. In RStudio, click Tools > Shell and enter:

Next, go to RStudio menu Tools > Global Options > Git/SVN.  Ensure the path to the Git executable is correct and update as needed.

 

 

 

 

 

 

 

 

 

 

 

 

Next, hit Create RSA Key for SSH to ensure secure communications between computers.

 

 

 

 

 

 

 

 

 

Close this window.  Click on View Public Key. Copy the window contents. Now, open GitHub, log-in, go to settings and then select SSH Keys. Click Add SSH Key.  Paste into GitHub the public key you have copied from RStudio.

Next, provide Git on your local machine with some simple and standard identity details.  In RStudio, click Tools > Shell and enter:

git config --global user.email "your.email@service.com"
git config --global user.name "your.GitHub.UserName"

The link is now in place between RStudio, Git and GitHub.  As a result, the project is  ready for implementation.

Create a Project with Git – Project Author

Create RStudio Project

In RStudio, go to File > New Project as normal. Click New Directory.

Name the project and check Create a git repository.

Install Project Template Directories

Next, download the standard project template of directories and default files and extract/copy into your project folder

Project Template (70 downloads)

 

For example, the merged files and directories should look close to this:

Create Source Code Script and Identify all new Files

Next, in RStudio, create a new script (test.R), save it in the src directory.  After saving the script, it should appear in RStudio’s the Git tab (located in one of the RStudio panes).  See the Git tab below:

Git has identified all the files and directories (e.g. config, docs, src) with files to be committed to Git’s local repository on the local workstation.  In the Staged column, click the files and directories to be committed locally.  The status should turn to a green ‘A’.

Commit New Files

First, click Commit. Second, enter an identifying update message in the Commit message field. The message entered is “Initiating repo with initial project files.” The commit confirmation is provided.  For instance:

Push Files to GitHub

Meanwhile, the final series of steps are simple. The goal is to push the file contents from the local repo to GitHub.  In GitHub, create a new private repository called Nearby-Prices.  For example:

Do not select the option to create a README.md file.  The project directory already has a README file on the local machine.

The synchronization and commit of the local Git repo to the GitHub repo is managed using  shell commands.  First, activiate the shell (Rstudio Tools > Shell) from the project working directory.  Second, enter the following commands:

The general format of the URL above is https://github.com/<user.name>/<repo.name>. Meanwhile, note the -u option on the push command.  The option adds the “upstream” master branch for the GitHub repo to the git config file (see git config –list).  The result is to link the local repo and  master branch.  The new config setting will now activate RStudio’s green push and blue pull buttons for easier synching.

Go to the browser with the GitHub repo. Hit Refresh to  confirm that the local repo files were pushed to GitHub. All the project files should be visible and committed on GitHub.

Link to a Project on GitHub – Project Collaborator

To link to an existing project on GitHub in RStudio, go to File > New Project > Version Control > Git. For example:

In the filed labelled “Repository URL” paste the URL of your new GitHub repository. Using the repo we created above, it would be: https://github.com/bxhorn/nearby-prices.git.  The remaining fields require a project name and the specific  directory to locate the project. Click “Create Project” to:

  • Create the project directory on your computer
  • Link the local Git repository to the remote GitHub repository
  • Launch the RStudio Project

Easy! The collaborator role is quick to establish. In fact, authors can establish things just as quick if they begin by establishing the GitHub repo first (not after the R project is created).  The big advantage to the “GitHub first, then RStudio” workflow is simple: the remote GitHub repo is added as a remote for your local repo and your local master branch is now tracking the master branch on GitHub.

Work and Ongoing Synchronization

Project files are easy to control with the local and GitHub repos linked.  Above all, do work and commit changes frequently to the local repo.  Push the changes to GitHub repo less frequently.

If a collaborator has updated work recently, start a work session by pulling the latest file updates into the local repo.  Then, make changes as needed, and commit them to the local repo frequently.  As needed, push updated files back to Github.

In summary, do work, commit frequently, and pull or push to the cloud as needed. And then, repeat, repeat, repeat.

A more detailed Git discussion – including command references, work-flow forking, branch management and trouble-shooting – can be found here.

Back | Next