Regular Expressions (RegEx) in R

In computing, a regular expression (abbreviated regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.  The patterns are often a combination of text abbreviations, metacharacters, and wild cards.  Regular expressions are used for searching for objects, doing extractions, or find/replace operations.  The use of regular expressions offers convenience and can have powerful impact on data or object management.

regexp Functions in R

Functions in R for regular expressions include:

[table id=34 /]

For example:

Basic Pattern Concepts

A pattern is an expression used to specify a set of strings required for a particular purpose. A simple way to specify a set of strings is complete enumeration, or simply listing all elements or members. However, there are more concise ways to specify the desired set of strings. For example, the set containing the three strings “Handel”, “Händel”, and “Haendel” can be specified by the pattern H(ä|ae?)ndel.  This pattern matches each of the three strings. If there exists at least one regexp that matches a particular set then there exists at least another pattern, and possibly an infinite number of patterns, that geenrate the same result.  

Pattern matching frequently makes use of the following operations to construct regular expressions.

Boolean “or”
A vertical bar separates alternatives. For example, gray|grey can match “gray” or “grey”.

Parenthesis define the scope and precedence of the operators. For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of “gray” or “grey”.

A quantifier after a token (such as a character) or group specifies how often that element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk *, and the plus sign +. For example:

      • ? The question mark indicates that there is 0 or 1 of the preceding elements.  Hence, colou?r is a pattern that matches both color and colour;
      • * The asterisk is a well known wild card that indicates there is 0 or more of the preceding elements.  Hence, ab*c matches ac, abc, abbc, abbbc, and so on; and
      • + The plus sign indicates there is one or more of the preceding elements.  Thus, ab+c matches abc, abbc, abbbc, and so on, but not ac.

These constructions can be combined to form arbitrarily complex expressions, much like one can constructs arithmetical expressions from numbers. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

Common Metacharacters

Basic metacharacters are listed below:

[table id=35 /]

POSIX Character Classes

Character classes are the most basic regular expression concepts after complete enumeration.  POSIX codes can be used in place of their ASCII equivalent.  The choice to use POSIX and ASCII is simple: always use the easiest and most intuitive option:

[table id=36 /]

For example, [[:upper:]ab] matches the uppercase letters and lowercase “a” and “b”.

Character Functions

The following table lists common character functions that can be used with regular expression:

[table id=12 /]

Back | Next