Pattern Matching
Matching variables
$1, $2, ...
- Used outside of a regex
- When grouping metacharacters
() are used,
the parts of a string that matched are saved in matching variables
# Replace abcdef with cd
s/ab(cd)ef/$1/
Backreferences
\1, \2, ...
- Use inside a regex
- Backreferences are essentially matching variables that can be
used inside
a regex
# Replace abcabc with xyz
s/(abc)\1/xyz/
# Find all 3-letter words that are repeated twice with a space in-between ("the the")
/(\w\w\w)\s\1/
Matching operator
- Find a matching string
//
m//
/reg-expr/
- Operates on
$_ by default
- Can be bound to another variable using
=~
- Returns a true value if the pattern
matched; otherwise, it
returns false
- Can use other delimiters when the "m"
prefix is used, such as
m||, m!!
and m{}
- If single quotes are
used,
m'', then the regex is treated as a single-quoted
string (no substitutions are made)
- In scalar context, a match /regex/ will
return the value "
1"
upon a successful match, otherwise it will return the value ""
- In list context, a match /regex/ with no
groupings
() will
return a list containing each successful match
- A match /regex/ with groupings
() will
implicitly assign each
grouping of matched values to a list of the form ($1, $2, ...).
In list context, the groupings will also be explicitly assigned to the
stated list
@-, @+
$-[0] : position of the start of the entire
match
$+[0]
: position of the end
$-[n] : position of the start of the $n
match
$+[n] : position of the end
- If
$n is undefined, so are $-[n]
and $+[n]
$str =~ /orange/ # Search for "orange" in $str
$str = 'red';
m'$str'; # Matches '$str', not 'red'
# extract hours, minutes, seconds
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
# extract hours, minutes, seconds
$time =~ /(\d\d):(\d\d):(\d\d)/; # Match hh:mm:ss format
$hours = $1;
$minutes = $2;
$seconds = $3;
# extract minutes
($minutes) = ($time =~ /\d\d:(\d\d):\d\d/);
'cathouse' =~ /cat$foo/; # matches
'housecat' =~ /${foo}cat/; # matches
Substitution operator
- Search and replace
s///
s|||
s/reg-expr/replacement-string/modifiers
- If there is a match,
s///
returns the number of substitutions
made, otherwise it returns false
- Can use other delimiters, such as
s!!!
and s{}{}, and even
s{}//. If single quotes are
used, s''', then the regex and
replacement are treated as single quoted strings
s/apples/oranges/ # Replace the first occurrence of "apples" with "oranges"
s/cats/dogs/g # Replace all occurrences of "cats" with "dogs"
$y = "'quoted words'";
$y =~ s/^'(.*)'$/$1/; # strip single quotes; $y contains "quoted words"
# reverse all the words in a string
$x = "the cat in the hat";
$x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
# convert percentage to decimal
$x = "A 39% hit rate";
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
Command line
- perl operates in a line-by-line manner (like sed) when run in
this way
perl -pi -e "s|apple|orange|gis;" "$file" # apple ==> orange
perl -pi -e "s|([^q])q([^q])|\$1 \$2|gs;" "$file" # Remove isolated q's (aqb ==> a b)
# Backreferences (substitution variables)
perl -pi -e "s|a(p)\1le|orange|gs;" "$file"
# Matching variables
perl -pi -e "s|a(p)ple|\1|gs;" "$file"
perl -pi -e "s|a(p)ple|\$1|gs;" "$file"
Split operator
split //, STRING
split /reg-expr/, STRING
- The regex determines the character
sequence that the string
is split with respect to
- If the empty regex,
//, is
used, the string is split into
individual characters
- If the regex has groupings, then the list
produced contains the
matched substrings from the groupings as well
@arr = split /\s+/, "zero one two"; # Whitespace: "zero", "one", "two"
@arr = split /,\s*/, "aa,bb, cc"; # Comma-delimited: "aa", "bb", "cc"
@arr = split //, "xyz"; # Individual characters: "a", "b", "c"
@arr = split /(:)/ "10:20"; # Groupings: "10", ":", "20"
Modifiers
Default behavior
. matches any
character except \n
^ matches only at
the beginning of the string
$ matches only at the
end
or before a newline at the end
/s
- Treats the
string as a single long line
.
matches any character, including \n
^ matches
only at the beginning of the string
$ matches only
at
the end or before a newline at the end
/m
- Treats the
string as a set of multiple lines
.
matches any character except \n
^ and $
are can match at the start or end of any line within the
string
/sm
- Treats the string as a single long
line, but detects multiple lines
. matches any
character, including \n
^ and $ can match at the start or end
of any line
within the string
/g
- Global (applies to all occurrences of the search pattern)
/i
/o
- Performs variable substitutions in the regex only once (useful in
loops)
/x
- Allows extended regular expressions (improves the readability by
allowing whitespace and comments to be used)
Modifiers specific to matching
/c
- The search position on a failed match is not reset when /g is in
effect
Modifiers specific to substitution
/e
- Evaluates the right side as an expression