Pattern Matching

Matching variables

  • $1, $2, ...
  • Used outside of a regex
  • When grouping metacharacters () are used, the parts of a string that matched are saved in matching variables
# Replace abcdef with cd                                                                                                                                                                 
s/ab(cd)ef/$1/

Backreferences

  • \1, \2, ...
  • Use inside a regex
  • Backreferences are essentially matching variables that can be used inside a regex
# Replace abcabc with xyz
s/(abc)\1/xyz/
# Find all 3-letter words that are repeated twice with a space in-between ("the the")
/(\w\w\w)\s\1/

Matching operator

  • Find a matching string
  • //
  • m//
  • /reg-expr/
  • Operates on $_ by default
  • Can be bound to another variable using =~
  • Returns a true value if the pattern matched; otherwise, it returns false
  • Can use other delimiters when the "m" prefix is used, such as m||m!! and m{}
  • If single quotes are used, m'', then the regex is treated as a single-quoted string (no substitutions are made)
  • In scalar context, a match /regex/ will return the value "1" upon a successful match, otherwise it will return the value ""
  • In list context, a match /regex/ with no groupings () will return a list containing each successful match
  • A match /regex/ with groupings () will implicitly assign each grouping of matched values to a list of the form ($1, $2, ...).  In list context, the groupings will also be explicitly assigned to the stated list
@-, @+
  • $-[0]  : position of the start of the entire match
  • $+[0]  : position of the end
  • $-[n]  : position of the start of the $n match
  • $+[n]  : position of the end
  • If $n is undefined, so are $-[n] and $+[n]
$str =~ /orange/                   # Search for "orange" in $str
$str = 'red';
m'$str'; # Matches '$str', not 'red'
# extract hours, minutes, seconds
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

# extract hours, minutes, seconds
$time =~ /(\d\d):(\d\d):(\d\d)/; # Match hh:mm:ss format
$hours = $1;
$minutes = $2;
$seconds = $3;

# extract minutes
($minutes) = ($time =~ /\d\d:(\d\d):\d\d/);
'cathouse' =~ /cat$foo/;    # matches
'housecat' =~ /${foo}cat/; # matches

Substitution operator

  • Search and replace
  • s///
  • s|||
  • s/reg-expr/replacement-string/modifiers
  • If there is a match, s/// returns the number of substitutions made, otherwise it returns false
  • Can use other delimiters, such as s!!! and s{}{}, and even s{}//. If single quotes are used, s''', then the regex and replacement are treated as single quoted strings
s/apples/oranges/      # Replace the first occurrence of "apples" with "oranges"
s/cats/dogs/g # Replace all occurrences of "cats" with "dogs"
$y = "'quoted words'";
$y =~ s/^'(.*)'$/$1/; # strip single quotes; $y contains "quoted words"
# reverse all the words in a string
$x = "the cat in the hat";
$x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
# convert percentage to decimal
$x = "A 39% hit rate";
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
Command line
  • perl operates in a line-by-line manner (like sed) when run in this way
perl -pi -e "s|apple|orange|gis;" "$file"           # apple ==> orange
perl -pi -e "s|([^q])q([^q])|\$1 \$2|gs;" "$file" # Remove isolated q's (aqb ==> a b)

# Backreferences (substitution variables)
perl -pi -e "s|a(p)\1le|orange|gs;" "$file"

# Matching variables
perl -pi -e "s|a(p)ple|\1|gs;" "$file"
perl -pi -e "s|a(p)ple|\$1|gs;" "$file"

Split operator

  • split //, STRING
  • split /reg-expr/, STRING
  • The regex determines the character sequence that the string is split with respect to
  • If the empty regex, //, is used, the string is split into individual characters
  • If the regex has groupings, then the list produced contains the matched substrings from the groupings as well
@arr = split /\s+/, "zero one two"; # Whitespace: "zero", "one", "two" 
@arr = split /,\s*/, "aa,bb, cc"; # Comma-delimited: "aa", "bb", "cc"
@arr = split //, "xyz"; # Individual characters: "a", "b", "c"
@arr = split /(:)/ "10:20"; # Groupings: "10", ":", "20"

Modifiers

Default behavior
  • . matches any character except \n
  • ^ matches only at the beginning of the string
  • $ matches only at the end or before a newline at the end
/s
  • Treats the string as a single long line
  • . matches any character, including \n
  • ^ matches only at the beginning of the string
  • $ matches only at the end or before a newline at the end
/m
  • Treats the string as a set of multiple lines
  • . matches any character except \n
  • ^ and $ are can match at the start or end of any line within the string
/sm
  • Treats the string as a single long line, but detects multiple lines
  • . matches any character, including \n
  • ^ and $ can match at the start or end of any line within the string
/g
  • Global (applies to all occurrences of the search pattern)
/i
  • Case insensitive
/o
  • Performs variable substitutions in the regex only once (useful in loops)
/x
  • Allows extended regular expressions (improves the readability by allowing whitespace and comments to be used)

Modifiers specific to matching

/c
  • The search position on a failed match is not reset when /g is in effect

Modifiers specific to substitution

/e
  • Evaluates the right side as an expression