regex                  package:base                  R Documentation

_R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s _a_s _u_s_e_d _i_n _R

_D_e_s_c_r_i_p_t_i_o_n:

     This help page documents the regular expression patterns supported
     by 'grep' and related functions 'regexpr', 'sub' and 'gsub', as
     well as by 'strsplit'.

_D_e_t_a_i_l_s:

     A 'regular expression' is a pattern that describes a set of
     strings.  Three types of regular expressions are used in R,
     _extended_ regular expressions, used by 'grep(extended = TRUE)'
     (its default), _basic_ regular expressions, as used by
     'grep(extended = FALSE)', and _Perl-like_ regular expressions used
     by 'grep(perl = TRUE)'.

     Other functions which use regular expressions (often via the use
     of 'grep') include 'apropos', 'browseEnv', 'help.search',
     'list.files', 'ls' and 'strsplit'. These will all use _extended_
     regular expressions, unless 'strsplit' is called with argument
     'extended = FALSE' or 'perl = TRUE'.

     Patterns are described here as they would be printed by 'cat': do
     remember that backslashes need to be doubled in entering R
     character strings from the keyboard.

_E_x_t_e_n_d_e_d _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s:

     This section covers the regular expressions allowed if 'extended =
     TRUE' in 'grep', 'regexpr', 'sub', 'gsub' and 'strsplit'.  They
     use the GNU implementation of the POSIX 1003.2 standard.

     Regular expressions are constructed analogously to arithmetic
     expressions, by using various operators to combine smaller
     expressions.

     The fundamental building blocks are the regular expressions that
     match a single character.  Most characters, including all letters
     and digits, are regular expressions that match themselves.  Any
     metacharacter with special meaning may be quoted by preceding it
     with a backslash.  The metacharacters are '. \ | ( ) [ { ^ $ * +
     ?'.

     A _character class_ is a list of characters enclosed by '[' and
     ']' matches any single character in that list; if the first
     character of the list is the caret '^', then it matches any
     character _not_ in the list.  For example, the regular expression
     '[0123456789]' matches any single digit, and '[^abc]' matches
     anything except the characters 'a', 'b' or 'c'.  A range of
     characters may be specified by giving the first and last
     characters, separated by a hyphen.  (Character ranges are
     interpreted in the collation order of the current locale.)

     Certain named classes of characters are predefined.  Their
     interpretation depends on the _locale_ (see locales); the
     interpretation below is that of the POSIX locale.

     '[:_a_l_n_u_m:]' Alphanumeric characters: '[:alpha:]' and '[:digit:]'.

     '[:_a_l_p_h_a:]' Alphabetic characters: '[:lower:]' and '[:upper:]'.

     '[:_b_l_a_n_k:]' Blank characters: space and tab.

     '[:_c_n_t_r_l:]' Control characters.  In ASCII, these characters have
          octal codes 000 through 037, and 177 ('DEL').  In another
          character set, these are the equivalent characters, if any.

     '[:_d_i_g_i_t:]' Digits: '0 1 2 3 4 5 6 7 8 9'.

     '[:_g_r_a_p_h:]' Graphical characters: '[:alnum:]' and '[:punct:]'.

     '[:_l_o_w_e_r:]' Lower-case letters in the current locale.

     '[:_p_r_i_n_t:]' Printable characters: '[:alnum:]', '[:punct:]' and
          space.

     '[:_p_u_n_c_t:]' Punctuation characters: '! " # $ % & ' ( ) * + , - . /
          : ; < = > ? @ [ \ ] ^ _ ` { | } ~'.

     '[:_s_p_a_c_e:]' Space characters: tab, newline, vertical tab, form
          feed, carriage return, and space.

     '[:_u_p_p_e_r:]' Upper-case letters in the current locale.

     '[:_x_d_i_g_i_t:]' Hexadecimal digits: '0 1 2 3 4 5 6 7 8 9 A B C D E F
          a b c d e f'.

     For example, '[[:alnum:]]' means '[0-9A-Za-z]', except the latter
     depends upon the locale and the character encoding, whereas the
     former is independent of locale and character set.  (Note that the
     brackets in these class names are part of the symbolic names, and
     must be included in addition to the brackets delimiting the
     bracket list.) Most metacharacters lose their special meaning
     inside lists.  To include a literal ']', place it first in the
     list.  Similarly, to include a literal '^', place it anywhere but
     first.  Finally, to include a literal '-', place it first or last.
     (Only these and '\' remain special inside character classes.)

     The period '.' matches any single character.  The symbol '\w' is
     documented to be synonym for '[[:alnum:]]' and '\W' is its
     negation.  However, '\w' also matches underscore in the GNU grep
     code used in R.

     The caret '^' and the dollar sign '$' are metacharacters that
     respectively match the empty string at the beginning and end of a
     line.  The symbols '\<' and '\>' respectively match the empty
     string at the beginning and end of a word.  The symbol '\b'
     matches the empty string at the edge of a word, and '\B' matches
     the empty string provided it is not at the edge of a word.

     A regular expression may be followed by one of several repetition
     quantifiers:

     '?' The preceding item is optional and will be matched at most
          once.

     '*' The preceding item will be matched zero or more times.

     '+' The preceding item will be matched one or more times.

     '{_n}' The preceding item is matched exactly 'n' times.

     '{_n,}' The preceding item is matched 'n' or more times.

     '{_n,_m}' The preceding item is matched at least 'n' times, but not
          more than 'm' times.

     Repetition is greedy, so the maximal possible number of repeats is
     used.

     Two regular expressions may be concatenated; the resulting regular
     expression matches any string formed by concatenating two
     substrings that respectively match the concatenated
     subexpressions.

     Two regular expressions may be joined by the infix operator '|';
     the resulting regular expression matches any string matching
     either subexpression.   For example, 'abba|cde' matches either the
     string 'abba' or the string 'cde'.  Note that alternation does not
     work inside character classes, where '|' has its literal meaning.

     Repetition takes precedence over concatenation, which in turn
     takes precedence over alternation.  A whole subexpression may be
     enclosed in parentheses to override these precedence rules.

     The backreference '\N', where N is a single digit, matches the
     substring previously matched by the Nth parenthesized
     subexpression of the regular expression.

     The current code attempts to support traditional usage by assuming
     that '{' is not special if it would be the start of an invalid
     interval specification.  (POSIX allows this behaviour as an
     extension but we advise users not to rely on it.)

_B_a_s_i_c _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s:

     This section covers the regular expressions allowed if 'extended =
     FALSE' in 'grep', 'regexpr', 'sub', 'gsub' and 'strsplit'.

     In basic regular expressions the metacharacters '?', '+', '{',
     '|', '(', and ')' lose their special meaning; instead use the
     backslashed versions '\?', '\+', '\ {', '\|', '\(', and '\)'. 
     Thus the metacharacters are '. \ [ ^ $ *'.

_P_e_r_l _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s:

     The 'perl = TRUE' argument to 'grep', 'regexpr', 'sub', 'gsub' and
     'strsplit' switches to the PCRE library that 'implements regular
     expression pattern matching using the same syntax and semantics as
     Perl 5.6 or later, with just a few differences'.

     For complete details please consult the man pages for PCRE,
     especially 'man pcrepattern' and 'man pcreapi') on your system or
     from the sources at <URL:
     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/>. If PCRE
     support was compiled from the sources within R, the PCRE version
     is 4.5 as described here (version >= 4.0 is required even if R is
     configured to use the system's PCRE library).

     All the regular expressions described for extended regular
     expressions are accepted except '\<' and '\>': in Perl all
     backslashed metacharacters are alphanumeric and backslashed
     symbols always are interpreted as a literal character. '{' is not
     special if it would be the start of an invalid interval
     specification.  There can be more than 9 backreferences.

     The construct '(?...)' is used for Perl extensions in a variety of
     ways depending on what immediately follows the '?'.

     Perl-like matching can work in several modes, set by the options
     '(?i)' (caseless, equivalent to Perl's '/i'), '(?m)' (multiline,
     equivalent to Perl's '/m'), '(?s)' (single line, so a dot matches
     all characters, even new lines: equivalent to Perl's '/s') and
     '(?x)' (extended, whitespace data characters are ignored unless
     escaped and comments are allowed: equivalent to Perl's '/x'). 
     These can be concatenated, so for example, '(?im)' sets caseless
     multiline matching.  It is also possible to unset these options by
     preceding the letter with a hyphen, and to combine setting and
     unsetting such as '(?im-sx)'.  These settings can be applied
     within patterns, and then apply to the remainder of the pattern.
     Additional options not in Perl include '(?U)' to set 'ungreedy'
     mode (so matching is minimal unless '?' is used, when it is
     greedy).  Initially none of these options are set.

     If you want to remove the special meaning from a sequence of
     characters, you can do so by putting them between '\Q' and '\E'.
     This is different from Perl in that '$' and '@' are handled as
     literals in '\Q...\E' sequences in PCRE, whereas in Perl, '$' and
     '@' cause variable interpolation.

     The escape sequences '\d', '\s' and '\w' represent any decimal
     digit, space character and 'word' character (letter, digit or
     underscore in the current locale) respectively, and their
     upper-case versions represent their negation. Unlike POSIX and
     earlier versions of Perl and PCRE, vertical tab is not regarded as
     a whitespace character.

     Escape sequence '\a' is 'BEL', '\e' is 'ESC', '\f' is 'FF', '\n'
     is 'LF', '\r' is 'CR' and '\t' is 'TAB'.  In addition '\cx' is
     'cntrl-x' for any 'x', '\ddd' is the octal character 'ddd' (for up
     to three digits unless interpretable as a backreference), and
     '\xhh' specifies a character in hex.

     Outside a character class, '\b' matches a word boundary, '\B' is
     its negation, '\A' matches at start of a subject (even in
     multiline mode, unlike '^'), '\Z' matches at end of a subject or
     before newline at end, '\z' matches at end of a subject. and '\G'
     matches at first matching position in a subject. '\C' matches a
     single byte. including a newline.

     The same repetition quantifiers as extended POSIX are supported.
     However, if a quantifier is followed by '?', the match is
     'ungreedy', that is as short as possible rather than as long as
     possible (unless the meanings are reversed by the '(?U)' option.)

     The sequence '(?#' marks the start of a comment which continues up
     to the next closing parenthesis.  Nested parentheses are not
     permitted.  The characters that make up a comment play no part at
     all in the pattern matching.

     If the extended option is set, an unescaped '#' character outside
     a character class introduces a comment that continues up to the
     next newline character in the pattern.

     The pattern '(?:...)' groups characters just as parentheses do but
     does not make a backreference.

     Patterns '(?=...)' and '(?!...)' are zero-width positive and
     negative lookahead _assertions_: they match if an attempt to match
     the '...' forward from the current position would succeed (or
     not), but use up no characters in the string being processed.
     Patterns '(?<=...)' and '(?<!...)' are the lookbehind equivalents:
     they do not allow repetition quantifiers nor '\C' in '...'.

     Named subpatterns, atomic grouping, possessive qualifiers and
     conditional and recursive patterns are not covered here.

_A_u_t_h_o_r(_s):

     This help page is based on the documentation of GNU grep 2.4.2,
     from which the C code used by R has been taken, the 'pcre' man
     page from PCRE 3.9 and the 'pcrepattern' man page from PCRE 4.4.

_S_e_e _A_l_s_o:

     'grep', 'apropos', 'browseEnv', 'help.search', 'list.files', 'ls'
     and 'strsplit'.

