Regular Expressions (RegEx) in R

In computing, a regular expression (abbreviated regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.  The patterns are often a combination of text abbreviations, metacharacters, and wild cards.  Regular expressions are used for searching for objects, doing extractions, or find/replace operations.  The use of regular expressions offers convenience and can have powerful impact on data or object management.

regexp Functions in R

Functions in R for regular expressions include:

grep(regexp, vector)Finds all the strings in the vector that contain a substring match regexp
sub(regexp, replacement, vector)Replaces the first substring matching the regular expression with the replacement (for each element of the vector).
gsub()Does the same thing as sub() but can make more than one replacement per string.
regexpr(regexp, vector)Returns the position of the first match within each string.
gregexpr()Is the same as regexpr() except that it returns all matches.
strsplit()Splits a string at each match to a regular expression
glob2rx()Converts filename wildcard specifications to regular expressions.

For example:

Basic Pattern Concepts

A pattern is an expression used to specify a set of strings required for a particular purpose. A simple way to specify a set of strings is complete enumeration, or simply listing all elements or members. However, there are more concise ways to specify the desired set of strings. For example, the set containing the three strings “Handel”, “Händel”, and “Haendel” can be specified by the pattern H(ä|ae?)ndel.  This pattern matches each of the three strings. If there exists at least one regexp that matches a particular set then there exists at least another pattern, and possibly an infinite number of patterns, that geenrate the same result.  

Pattern matching frequently makes use of the following operations to construct regular expressions.

Boolean “or”
A vertical bar separates alternatives. For example, gray|grey can match “gray” or “grey”.

Parenthesis define the scope and precedence of the operators. For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of “gray” or “grey”.

A quantifier after a token (such as a character) or group specifies how often that element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk *, and the plus sign +. For example:

      • ? The question mark indicates that there is 0 or 1 of the preceding elements.  Hence, colou?r is a pattern that matches both color and colour;
      • * The asterisk is a well known wild card that indicates there is 0 or more of the preceding elements.  Hence, ab*c matches ac, abc, abbc, abbbc, and so on; and
      • + The plus sign indicates there is one or more of the preceding elements.  Thus, ab+c matches abc, abbc, abbbc, and so on, but not ac.

These constructions can be combined to form arbitrarily complex expressions, much like one can constructs arithmetical expressions from numbers. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

Common Metacharacters

Basic metacharacters are listed below:

$Matches the ending position of the string.
( )Defines a marked subexpression.
*Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.
+Matches the preceding element one or more times. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
.Matches any single character. For example, a.c matches "abc".
?Matches the preceding element zero or one time. For example, ab?c matches only "ac" or "abc".
[]A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].
[^ ]Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.
\nMatches what the nth marked subexpression matched, where n is a digit from 1 to 9.
^Matches the starting position within the string.
{m, n}Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa".
|The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches "abc" or "def".

POSIX Character Classes

Character classes are the most basic regular expression concepts after complete enumeration.  POSIX codes can be used in place of their ASCII equivalent.  The choice to use POSIX and ASCII is simple: always use the easiest and most intuitive option:

[:alnum:][A-Za-z0-9]All letters and digits
[:alpha:][A-Za-z]All letters
[:blank:][ \t]Space and tab
[:cntrl:][\x00-\x1F\x7F]Control characters
[:digit:][0-9]All digits
[:graph:][\x21-\x7E]Printed characters
[:lower:][a-z]Lower case letters
[:upper:][A-Z]Upper case letters
[:print:][\x20-\x7E]Printed characters and spaces
[:punct:][][!"#$%&'()*+,./:;<=>?@\^_`{|}~-] Punctuation characters
[:space:][ \t\r\n\v\f]Blank characters
[:xdigit:][A-Fa-f0-9]Hexidecimal digits

For example, [[:upper:]ab] matches the uppercase letters and lowercase “a” and “b”.

Character Functions

The following table lists common character functions that can be used with regular expression:

abbreviate()Abbreviate vector elements
character(); as.character()Coerce data to character strongs
casefold()Change characters to all lower/upper case
charmatch(); match(); pmatch()Match or partially match a character string
grep()Searches for patterns in character vectors and returns indices when a match is found. VERY USEFUL
regexpr()Same as above, but returns both index location and string length
nchar()Count the number of chaarcters in a string
paste(); paste0(); unpaste()Combine or separate character strings
substring()Extract part of a character string

Back | Next