In this part, we are going to make a introduction to regular expressions, showing the basic concepts and features.
Simple Pattern Matching
A regular expression is a description of a pattern of characters. The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters. Basically, regular expressions work just like a regular search, but it holds some additional features to manipulate patterns of characters.
We can start trying to find a simple word inside a sentense, as the example below:
Regex:
cars
Matches:
In the example above, we created a simple pattern that matches the word "cars" inside a sentence. The regex engine went through the sentense, character by character, trying to find the sequence of characters that was specified in the regular expression. When it successfully finds a match, the engine returns it.
Note: If we want to find more than one occurrence inside the text, we could use the g
flag in our regular expression, as shown in example below:
Regex:
/the/g
Matches:
In this example, we added two slashes (/
) surrounding the whole expression to tell the engine that we want to use a flag. And then, in the end of the regular expression we added our g flag. As a result, we obtained four occurrences of the word "the".
The g flag stands for "Global", which means that the regex engine is not going to stop on the first match. It is going to continue looking globally.
There is a list of the most common flags that could be used within our regular expressions, see the table below:
Flag | Description |
---|---|
g | Global. It doesn't return after the first match. Instead, it keeps looking globally until it finds all occurrences possible. |
m | Multi line. The regex engine treats the string as a multi line text, so the ^ and $ delimiters can be used. |
i | Insensitive. Basically, this option ignore cases. Matches lowercase and uppercase characters equally. |
s | Single line. The regex engine treats the string as a whole single line, so the dot metacharacter matches new lines as well. |
x | eXtended. Ignore whitespaces. |
For now on, lets assume we are going to use the g flag to all examples. Just so we don't have to specify that all the time.
String literals and metacharacters
The most obvious feature of regular expressions is matching strings with one
or more characters, as shown in the previous topic. Most of regular expressions use normal ASCII characters, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!
Unicode characters can also be used to match any type of international text. These are called literal characters, or just literals. They are simple characters used to form regular expressions with no special rules.
There are some type of characters that has some special meaning in regular expressions. These characters are called metacharacters and they are used to apply rules and patterns in order to create more elaborated expressions. The most common metacharacters are .?*+{}()[]\
.
Note: Metacharacters are used to apply some feature in the regular expression, which means we cannot put them freely inside the expression and expect them to match that specific character, otherwise the engine is going to throw some errors. In order to do that, we must transform the metacharacter in a literal character simply by using the backslash (\
) character. See the example:
Regex:
Google\+
Matches:
One thing to keep in mind is that we can transform any metacharacter in a literal character just by using backslash, including the backslash itself. So we need to pay attention to this detail when trying to understand or write regular expressions.
Matching digits
When using regular expressions, most of the situations we need to declare patterns as generic as possible. For example, lets say we need to find a number inside a sentense, but we don't know exactly which number that is. If we especified all the possibilities of numbers, our search could become larger, costly and painful to write.
So, in order to find a generic number inside a text, we can use a character shorthand \d
. See the example below:
Regex:
\d
Matches:
The regular expression found two occurrences of digits inside the sentense. That's because the character shorthand \d
matches any character from 0 to 9. In other words, it matches all numbers in decimal notation.
Another way to match digits is using the character class [0-9]
. We are going to talk more about character shorthands and character classes later. The only thing to keep in mind now is that those two expressions does the same thing. The last expression is telling the regex engine to look for characters within the range from 0 to 9.
One more example, but this time using character class:
Regex:
[0-9]
Matches:
Matching words
In regular expressions, we can use the character shorthand \w
to match any word character. Word characters are essencially letters, numbers and underline. It is the same as using the character class [a-zA-Z0-9_]
. For example:
Regex:
\w
Matches:
In the example above, the expression matched all word characters from the sentense.
Note: Again, we are assuming that the global flag is enabled. So the regex engine found all the occurrences of characters that matches the specified class. If the flag wasn't enabled, it would match only the first character, in this case the s character.
Matching whitespaces
Match whitespaces can be useful in many situations. In order to do that using regular expressions, we could use the character shorthand \s
. It matches all whitespaces, tabs, line feeds and carriage returns in only one expression. We could also write this using character class [ \f\n\r\t\v]
.
- \f - Form feed
- \n - Line feed
- \r - Carriage Return
- \t - Tab
- \v - Vertical tab
Regex:
\s
Matches:
Notice that the regex engine matched all whitespaces between the words, simple as that.
Matching any character
Sometimes, it comes to us the necessity of matching any kind of character in our regular expressions. We can use the metacharacter .
in order to do that. This metacharacter is widely used, and it essencially stands for any character, except for line ending characters, such as \n
and \r
. See the example:
Regex:
b.g
Matches:
Above, we defined a regular expression that starts with the character b, then it is followed by the metacharacter ., which matches any character, and then finishes with the character g. In that case, it found three candidates that matched that pattern.
Note: Again, if we want to match a specific metacharacter, first we need to transform it into a string literal scaping it with backslash. See the example:
Regex:
\.
Matches:
In this case, the dot metacharacter was converted to the string literal dot and the regex engine was able to find the occurrences of this specific character inside the sensence.
Character classes and Non-Caracter classes
As we have seen before, there is a way to define which kind of characters the regular expression must match. These definitions are called character classes.
In order to define a character class, we have to open a pair of brackets ([]
) and put the type of characters we desire to match in that part of the regular expression. See the following example:
Regex:
gr[ae]y
Matches:
grey
groy
It is as simple as it looks like. The regex is going to look for gr characters first, and then it is going to match any character that is specified inside the character class, which are a or e and then the last character y. If the regex doesn't find any of the specified characters, it doesn't match, that's what happened with the last word groy.
For a better understanding, we could assimilate character classes to the logic comparator OR. Using the same example above, the regular expression engine is going to accept either a or e. Otherwise, it returns false, and then it doesn't become a match.
We could also do the opposite of that using non-character classes. These are used to negate the occurrences of some characters. In order to do that, we put a circumflex accent (^
) before the characters we want to negate. See the example below:
Regex:
gr[^o]y
Matches:
grey
groy
It does the same as the other example. The difference is that in the last one we used non-character class, which negates the occurrence of the character o and matches anything else.
Using character classes, we can define ranges as well. For example, if we want to match only digits between 3 and 7, we could write a regular expression like this:
Regex:
[3-7]
Matches:
Basically, it matches the digits 3, 4, 5, 6 and 7.
Character Shorthands
Character shorthands are an easier way to specify some character classes, as we have seen it when matching digits, words and whitespaces.
Here is the list of most common character shorthands:
Shorthand | Description | Equals to |
---|---|---|
\a | Alert | (ASCII 7) |
\b | Word boundary | |
\B | Non-Word boundary | |
[\b] | Backspace character | |
\d | Digit character | [0-9] |
\D | Non-Digit character | [^0-9] |
\f | Form feed character | (ASCII 12) |
\r | Carriage retrun | (ASCII 13) |
\n | New line | (ASCII 10) |
\s | Whitespace Character | [ \r\n\t\f\v] |
\S | Non-Whitespace Character | [^ \r\n\t\f\v] |
\t | Horizontal tab character | (ASCII 9) |
\v | Vertical tab character | (ASCII 11) |
\w | Word character | [a-zA-Z0-9_] |
\W | Non-Word character | [^a-zA-Z0-9_] |
\0 | Nul character | |
\uxxxx | Unicode value for a character. (xxxx is the unicode value of the character) |
As seen in the character classes section, the character shorthands can also be negated. In order to do that, we just use de uppercase version of it. For example, if we want to match anything but digits, we can use the expression below:
Regex:
\D
Matches:
Simple as that, no explanation needed.