Regular expressions are used to perform string matching.
See the following tables for some common examples of regular expressions.
To specify a regular expression, add a “.REG.” operator before that
pattern.
There are a number of websites and tutorials available online.
One such site is the PerlDoc site, which can be found at:
![]() |
WARNINGRegular expressions are a powerful string
matching tool. For this reason, Trend Micro recommends that Administrators
who choose to use regular expressions be familiar and comfortable
with regular expression syntax. Poorly written regular expressions
can have a dramatic negative performance impact. Trend Micro recommends
is to start with simple regular expressions that do not use complex
syntax. When introducing new rules, use the archive action and observe
how the Messaging Security Agent manages messages using your rule. When
you are confident that the rule has no unexpected consequences,
you can change your action.
|
Regular Expression Examples
See the following
tables for some common examples of regular expressions. To specify
a regular expression, add a “.REG.” operator before that pattern.
Counting and Grouping
Element
|
What it Means
|
Example
|
.
|
The dot or period character represents any
character except new line character.
|
do. matches doe, dog, don, dos, dot, etc.
d.r
matches deer, door, etc.
|
*
|
The asterisk character means zero or more
instances of the preceding element.
|
do* matches d, do, doo, dooo, doooo, etc.
|
+
|
The plus sign character means one or more
instances of the preceding element.
|
do+ matches do, doo, dooo, doooo, etc. but
not d
|
?
|
The question mark character means zero or
one instances of the preceding element.
|
do?g matches dg or dog but not doog, dooog,
etc.
|
( )
|
Parenthesis characters group whatever is
between them to be considered as a single entity.
|
d(eer)+ matches deer or deereer or deereereer,
etc. The + sign is applied to the substring within parentheses,
so the regex looks for d followed by one or more of the grouping
“eer.”
|
[ ]
|
Square bracket characters indicate a set
or a range of characters.
|
d[aeiouy]+ matches da, de, di, do, du, dy,
daa, dae, dai, etc. The + sign is applied to the set within brackets
parentheses, so the regex looks for d followed by one or more of
any of the characters in the set [aeioy].
d[A-Z] matches dA,
dB, dC, and so on up to dZ. The set in square brackets represents
the range of all upper-case letters between A and Z.
|
[ ^ ]
|
Carat characters within square brackets
logically negate the set or range specified, meaning the regex will
match any character that is not in the set or range.
|
d[^aeiouy] matches db, dc or dd, d9, d#--d
followed by any single character except a vowel.
|
{ }
|
Curly brace characters set a specific number
of occurrences of the preceding element. A single value inside the
braces means that only that many occurrences will match. A pair
of numbers separated by a comma represents a set of valid counts
of the preceding character. A single digit followed by a comma means
there is no upper bound.
|
da{3} matches daaa--d followed by 3 and
only 3 occurrences of “a”. da{2,4} matches daa, daaa, daaaa, and
daaaa (but not daaaaa)--d followed by 2, 3, or 4 occurrences of
“a”. da{4,} matches daaaa, daaaaa, daaaaaa, etc.--d followed by
4 or more occurrences of “a”.
|
Character Classes (shorthand)
Element
|
What it Means
|
Example
|
\d
|
Any digit character; functionally equivalent
to [0-9] or [[:digit:]]
|
\d matches 1, 12, 123, etc., but not 1b7--one
or more of any digit characters.
|
\D
|
Any non-digit character; functionally equivalent
to [^0-9] or [^[:digit:]]
|
\D matches a, ab, ab&, but not 1--one
or more of any character but 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
|
\w
|
Any “word” character--that is, any alphanumeric
character; functionally equivalent to [_A-Za-z0-9] or [_[:alnum:]]
|
\w matches a, ab, a1, but not !&--one
or more upper- or lower-case letters or digits, but not punctuation
or other special characters.
|
\W
|
Any non-alphanumeric character; functionally
equivalent to [^_A-Za-z0-9] or [^_[:alnum:]]
|
\W matches *, &, but not ace or a1--one
or more of any character but upper- or lower-case letters and digits.
|
\s
|
Any white space character; space, new line,
tab, non-breaking space, etc.; functionally equivalent to [[:space]]
|
vegetable\s matches “vegetable” followed
by any white space character. So the phrase “I like a vegetable
in my soup” would trigger the regex, but “I like vegetables in my
soup” would not.
|
\S
|
Any non-white space character; anything
other than a space, new line, tab, non-breaking space, etc.; functionally
equivalent to [^[:space]]
|
vegetable\S matches “vegetable” followed by any non-white space character. So the phrase “I like vegetables in my soup” would trigger the regex, but “I like a vegetable in my soup” would not. |
Character Classes
Element
|
What it Means
|
Example
|
[:alpha:]
|
Any alphabetic characters
|
.REG. [[:alpha:]] matches abc, def, xxx,
but not 123 or @#$.
|
[:digit:]
|
Any digit character; functionally equivalent
to \d
|
.REG. [[:digit:]] matches 1, 12, 123, etc.
|
[:alnum:]
|
Any “word” character--that is, any alphanumeric
character; functionally equivalent to \w
|
.REG. [[:alnum:]] matches abc, 123, but
not ~!@.
|
[:space:]
|
Any white space character; space, new line,
tab, non-breaking space, etc.; functionally equivalent to \s
|
.REG. (vegetable)[[:space:]] matches “vegetable”
followed by any white space character. So the phrase “I like a vegetable
in my soup” would trigger the regex, but “I like vegetables in my
soup” would not.
|
[:graph:]
|
Any characters except space, control characters
or the like
|
.REG. [[:graph:]] matches 123, abc, xxx,
><”, but not space or control characters.
|
[:print:]
|
Any characters (similar with [:graph:])
but includes the space character
|
.REG. [[:print:]] matches 123, abc, xxx,
><”, and space characters.
|
[:cntrl:]
|
Any control characters (e.g. CTRL + C, CTRL
+ X)
|
.REG. [[:cntrl:]] matches 0x03, 0x08, but
not abc, 123, !@#.
|
[:blank:]
|
Space and tab characters
|
.REG. [[:blank:]] matches space and tab
characters, but not 123, abc, !@#
|
[:punct:]
|
Punctuation characters
|
.REG. [[:punct:]] matches ; : ? ! ~ @ #
$ % & * ‘ “ , etc., but not 123, abc
|
[:lower:]
|
Any lowercase alphabetic characters (Note:
‘Enable case sensitive matching’ must be enabled or else it will
function as [:alnum:])
|
.REG. [[:lower:]] matches abc, Def, sTress,
Do, etc., but not ABC, DEF, STRESS, DO, 123, !@#.
|
[:upper:]
|
Any uppercase alphabetic characters (Note:
‘Enable case sensitive matching’ must be enabled or else it will
function as [:alnum:])
|
.REG. [[:upper:]] matches ABC, DEF, STRESS,
DO, etc., but not abc, Def, Stress, Do, 123, !@#.
|
[:xdigit:]
|
Digits allowed in a hexadecimal number (0-9a-fA-F)
|
.REG. [[:xdigit:]] matches 0a, 7E, 0f, etc.
|
Pattern Anchors
Element
|
What it Means
|
Example
|
^
|
Indicates the beginning of a string.
|
^(notwithstanding) matches any block of
text that began with “notwithstanding” So the phrase “notwithstanding
the fact that I like vegetables in my soup” would trigger the regex,
but “The fact that I like vegetables in my soup notwithstanding”
would not.
|
$
|
Indicates the end of a string.
|
(notwithstanding)$ matches any block of
text that ended with “notwithstanding” So the phrase “notwithstanding
the fact that I like vegetables in my soup” would not trigger the
regex, but “The fact that I like vegetables in my soup notwithstanding”
would.
|
Escape Sequences and Literal Strings
Element
|
What it Means
|
Example
|
||
\
|
In order to match some characters that have
special meaning in regular expression (for example, “+”).
|
(1) .REG. C\\C\+\+ matches ‘C\C++’.
(2)
.REG. \* matches *.
(3) .REG. \? matches ?.
|
||
\t
|
Indicates a tab character.
|
(stress)\t matches any block of text that
contained the substring “stress” immediately followed by a tab (ASCII
0x09) character.
|
||
\n
|
Indicates a new line character.
|
(stress)\n\n matches any block of text that
contained the substring “stress” followed immediately by two new
line (ASCII 0x0A) characters.
|
||
\r
|
Indicates a carriage return character.
|
(stress)\r matches any block of text that
contained the substring “stress” followed immediately by one carriage
return (ASCII 0x0D) character.
|
||
\b
|
Indicates a backspace character.
OR
Denotes
boundaries.
|
(stress)\b matches any block of text that
contained the substring “stress” followed immediately by one backspace
(ASCII 0x08) character.
A word boundary (\b) is defined as
a spot between two characters that has a \w on one side of it and a
\W on the other side of it (in either order), counting the imaginary
characters off the beginning and end of the string as matching a
\W. (Within character classes \b represents backspace rather than
a word boundary.)
For example, the following regular expression
can match the social security number: .REG. \b\d{3}-\d{2}-\d{4}\b
|
||
\xhh
|
Indicates an ASCII character with given
hexadecimal code (where hh represents any two-digit hex value).
|
\x7E(\w){6} matches any block of text containing
a “word” of exactly six alphanumeric characters preceded with a
~ (tilde) character. So, the words ‘~ab12cd’, ‘~Pa3499’ would be
matched, but ‘~oops’ would not.
|
Regular Expression Generator
When deciding
how to configure rules for Data Loss Prevention, consider that the
regular expression generator can create only simple expressions
according to the following rules and limitations:
-
Only alphanumeric characters can be variables.
-
All other characters, such as [-], [/], and so on can only be constants.
-
Variable ranges can only be from A-Z and 0-9; you cannot limit ranges to, say, A-D.
-
Regular expressions generated by this tool are case-insensitive.
-
Regular expressions generated by this tool can only make positive matches, not negative matches (“if does not match”).
-
Expressions based on your sample can only match the exact same number of characters and spaces as your sample; the tool cannot generate patterns that match “one or more” of a given character or string.
Complex Expression Syntax
A keyword expression
is composed of tokens, which is the smallest unit used to match
the expression to the content. A token can be an operator, a logical symbol,
or the operand, i.e., the argument or the value on which the operator acts.
Operators
include .AND., .OR., .NOT., .NEAR., .OCCUR., .WILD., “.(.” and “ .).”
The operand and the operator must be separated by a space. An operand may
also contain several tokens. See Keywords.
Regular Expressions at Work
The following
example describes how the Social Security content filter, one of the
default filters, works:
[Format] .REG. \b\d{3}-\d{2}-\d{4}\b
The
above expression uses \b, a backspace character, followed by \d,
any digit, then by {x}, indicating the number of digits, and finally,
-, indicating a hyphen. This expression matches with the social
security number. The following table describes the strings that
match the example regular expression:
Numbers Matching the Social Security Regular Expression
.REG. \b\d{3}-\d{2}-\d{4}\b
|
|
333-22-4444
|
Match
|
333224444
|
Not a match
|
333 22 4444
|
Not a match
|
3333-22-4444
|
Not a match
|
333-22-44444
|
Not a match
|
If you modify the expression as follows,
[Format]
.REG. \b\d{3}\x20\d{2}\x20\d{4}\b
the new expression matches
the following sequence:
333 22 4444