Version control

From sasCommunity
Jump to: navigation, search

Version control (also known as revision control) is a method of saving versions of files, rather than simply modifying one copy (thereby overwriting previous versions). This is very useful in software development. By preserving changes as opposed to separate copies of code, it's easy to see only what is different between different versions, and go back to previous versions if necessary.

The Basics

How do you version control?

A rudimentary strategy is to rename successive versions, such as appending a date or version number to the file name, or storing in a folder with a different name. However, there are several software solutions that are easy-to-use and are widely used in industry and research. The benefit of using these tools is being able to track and view differences between successive versions of your files. If you want to go back in time, you can quickly revert. If you make a mistake, you can easily undo your changes, even across many files.

All programmers should know at least one version control system, and know it well. They can be used with any operating system, either through the command-line or a graphical (GUI) program. They can be run locally, or used with remote repositories on shared network drives or servers.

See below for a deeper explanation of the terminology and software options.

What can be version controlled?

Version control was originally developed for text files - namely, software code. So any SAS code or configuration files can be version controlled. CSV files are also a form of text files, and can be version controlled. Other data can (and probably should) be version controlled, but one should conduct more research into the capabilities of your version control software, and whether alternative data management/warehousing tools should be used. In SAS, generation data sets are one solution for versioning data sets.

Terminology

  • branch - a line of development. The master or trunk is the central branch, but branches may be made off of it. The process of integrating a branch back to the master is called merging.
  • clone - to copy the contents of a repository to a new directory.
  • commit - to save a changeset to the repository. This may include adding, removing, or editing a file or multiple files
  • commit message - each commit has a message about what was changed. There are different opinions on what the message should include, but in general it should describe what was changed in that commit.
  • diff - difference between any two given commits. You can also see the diff between your working copy and the branch you are on.
  • log - this is where the list of commits is visible. You can see commit messages, see the diff between different versions, and revert to a previous commit.
  • master - the official production copy of the project, sometimes called the trunk
  • merge - the process of integrating two branches. For example, merging a personal branch back in with the master/trunk. If the file has been edited independently, a merge conflict will occur.
  • move (or mv) - the version control method of renaming a file or moving it to another folder. This preserves the revision history. If this is not done, the software will think you deleted the file and added a completely new one.
  • pull - to get the latest copy of the entire project (or individual files) from the master
  • push - to send a commit or series of commits back to the master
  • repository - where the changesets are stored. Users pull, commit, and push changes from/to the repository.
  • revert - the method of "un-doing" a commit or several commits. This is one of the benefits of using version control.

More complex usage can include tags, which are essentially flags. For example, a series of commits may eventually result in a copy that the team is happy with, and is given a name like "Version 2.0" or "Release 1.4.7". In this way, it's easy to go back to what the code looked like at that time.

Via the command-line

When using the command line, you will typically run git <command> or svn <command>. You can use the command help to see a list of commands and get information about them. The commands add, remove, and status are frequently used as you incorporate version control into your workflow.

Version control software

Git

Git is one of the most common version control systems. Originally developed to maintain the Linux kernel source, git is widely used today, for both open- and closed-source projects.

What makes Git unique?

Git is a distributed version control system, as opposed a centralized system like SVN (see below). Rather than making every contribution to the same, central repository, users clone the "master" copy of the project, and make commits to their local branches. When they are ready for their changes to me made available to everybody, they can push to the master.

Using Git

Git can be installed via the free TortoiseGit software. GitHub is a company that hosts projects using git, and provides free open-source projects and paid closed-source plans. You can install their software by selecting your operating system here.

Resources

Confused? Not convinced? Check out Git and GitHub in Plain English.

Pro Git is a free book that is available online, and does an excellent job of explaining concepts and techniques of Git.

To practice your skills, you can check out this git game on GitHub.

Subversion

Subversion (SVN) is also one of the most common version control systems. It can be installed via the free TortoiseSVN software.

Using SVN

Unlike Git, when you use Subversion, you commit directly to the master. Branching is still possible, but a branch is also stored centrally, and users commit to that copy. Branches can be merged back to the trunk (another word for the central copy), which ends the life of the branch.

Resources

While Git and GitHub in Plain English does focus on git, many of the concepts are applicable to version control in general.

Version Control with Subversion is a free eBook about Subversion from O'Reilly Media.

Others

Mercurial is similar to git, and is easy to learn. You can see more on Mercurial's website. It is used by Facebook, who also contributes to the code.[1]

Some companies like Google use their own in-house solutions.[2]

Advanced techniques

See Version control/advanced for advanced concepts.

SAS version control integration

SAS software has some support for version control. See this blog post for use in Enterprise Guide (or this thread for process flow if your code runs on a server). For use within DI Studio, see this SAS Support page.

Final thoughts

best practices

While you may find conflicting recommendations or guidelines from different sources, it's generally best practice to commit early, and commit often. Commits can be undone; deleting changes or files before you commit them cannot (with some exception).

It is recommended that you make sure to have the latest copy of the code before you commit, especially if multiple people are working in the same branch/repository. In some cases, this may require pulling the latest code at the beginning of the day. In general, it is good practice to pull before you add/commit. It is much easier than handling a potentially complicated merge.

Write descriptive commit messages. They should be succinct, yet detailed enough so others (or your future self) can easily understand what was changed in that commit. Multiple lines may be used, in which case the first should be shorter/summarized, and the following lines more detailed.

which version control system/software should I use?

There are many methods and tools for version control. One isn't necessarily superior to the others, so research and consideration of your needs and goals will help determine the best for your project, team, or organization. See the discussion page for more on that.

Further reading