Regular Expression (Regex)
Searching documents for specific characters or strings has always been one of the most common repetitive tasks in information technology. You often want to replace or modify the text fragments or lines of code you’re searching for. This task becomes increasingly complex the more often the string appears in the document. In the 1950s, a solution was found in the formal languages of theoretical computer science: Regular expressions (regex) can dramatically simplify such repetitive tasks and are widely used in software development to this day.
What is a regular expression?
A regular expression (regex) is a unit that describes regular languages, which are a type of formal language. As a central tool in theoretical computer science, they serve as the basis for developing and running computer programs as well as constructing the necessary compilers. For this reason, regular expressions, which are often referred to as regex and are based on well-defined syntax rules, are used primarily in software development.
For every regular expression, there is a finite automaton (also known as a state machine) that accepts the language specified by the expression and is formed from a regular expression using the Thompson’s construction algorithm . At the same time, for every finite automaton there is also a regular expression that describes the language accepted by the automaton. This expression can be generated by by Kleene’s algorithm or Gaussian elimination.
A state machine is a behavior model consisting of states, state transitions, and actions. It is referred to as finite if the number of states that it can accept is finite (i.e. limited).
A well-known IT application for regex is the search-and-replace function in text editors, which computer pioneer Ken Thompson, one of the developers of the UNIX operating system, first implemented in the line-oriented editor QED in the 1960s and later in its descendant ed. This function allows you to find specific strings in text and, if desired, replace them with any other string.
A regular expression is a string based on syntax rules that allow you to describe character strings. As such, they are part of regular language, a subgroup of formal language that is especially important in information technology, particularly software development.
How does a regular expression work?
A regular expression can be formed by using regular characters (such as abc) only or by using a combination of regular characters and metacharacters (such as ab*c). The task of metacharacters is to describe certain character constructions or arrangements, such as whether a character should be at the beginning of the line or whether a character can or should occur exactly once or more or less frequently. The first regular expression example mentioned above work as follows:
abc: The simple regex pattern abc requires an exact match. In other words, the expression searches for all strings containing the characters “abc” in that exact order. This means the expression will match the question “Do you know your abcs?” as well as the sentence “The abcoulomb is an electromagnetic unit of charge.”
The second regular expression example works like this:
ab*c: By contrast, a regular expression with special characters works slightly differently because it searches for exact matches as well as special scenarios. In this example, the asterisk ensures that the expression searches for strings that begin with the letter “a” and end with the letter “c”. However, there can be any number of bs between a and c. As a result, “abc” as well as the strings “abbbbc” and “cbbabbcba” also constitute a match.
Each regex can also be linked to a specific action such as the “replace” operation mentioned above. This action is performed wherever the regular expression is true, meaning wherever there is a match as described in the examples above.
What are the challenges of using regex?
Regex instructions give you a lot of freedom because you always have several different options for solving any problem with a regular expression. However, the ability to achieve a desired result in various ways isn’t always an advantage.
For example, you can keep the instructions very general so that you always obtain the desired result in every case. But if you want to obtain the most accurate result possible, you have to form a specific regex pattern. There’s also a general rule for the length: The more compact a regular expression is, the less time it will take to process. Don’t lose sight of readability, however. If you want to change your regular expressions later, it’ll be a major obstacle if the original instructions are too complicated and, moreover, uncommented.
As a rule, when you create a regular expression, it’s important to find the optimal balance between compactness and specificity.
Which syntax rules apply to regex?
As mentioned earlier, regex can be used in a variety of languages, such as Perl, Python, Ruby, JavaScript, XML, or HTML, but their usefulness or function can differ considerably. For example, in JavaScript, regex patterns are used in the search (), match (), or replace () string methods, whereas expressions in XML documents are used to delimit element content. However, in terms of syntax, there are hardly any differences between programming languages and markup languages when it comes to regex:
A regular expression can consist of up to three parts, regardless of the language in which it is used:
Patterns | The central element is the pattern, i.e. the general search pattern. Alternatively, as explained in the previous section, the pattern can be composed solely of simple characters or a combination of simple characters and special characters. |
---|---|
Delimiters | Delimiters mark the beginning and end of the pattern. Basically, all non-alphanumeric characters (except backslashes) can be used as delimiters. For example, PHP supports hashtags (#pattern), percent signs (%pattern), plus signs (+pattern+), or tildes (~pattern~) as delimiters. However, most languages use straight quotes (“pattern") or slashes (/pattern/). |
Modifiers | Modifiers can be appended to a search pattern to modify the regular expression. For example, the modifier i ignores case sensitivity. It ensures that upper- and lowercase letters are treated the same and apply to all regular expressions by default. |
The following are typical syntax symbols used for adding specific options to patterns:
Special characters for regex syntax | Function | |
---|---|---|
[] | A pair of square brackets denotes a character class that always represents a single character in a search pattern. | |
() | A pair of parentheses denotes a group of characters that consists of one or more characters and can operate within one another. | |
- | Specifies a range (from [...] to [...]) if it is between two regular characters | |
^ | Limit the search to the beginning of a line (also functions as a negator in character classes) | |
$ | Limit the search to the end of a line | |
. | Represents any character | |
* | The character, class, or group in front of an asterisk (zero included) can occur any number of times. | |
+ | Character, class, or group in front of a plus sign must occur at least once. | |
? | Character, class or group in front of a question mark is optional and may occur only once. | |
Indicates two or more alternatives | ||
{n} | The preceding character, class or group occurs exactly n times. | |
{n,m} | The preceding character, preceding class or group occurs at least n times, but not more than m times. | |
{n,} | The preceding character, class or group occurs n or more times. | |
\b | Include the edge of a word | |
\B | Ignore the edge of a word | |
\d | Any decimal digit; shorthand for character class [0-9] | |
\D | Any character that is not a decimal digit; short notation for character class [^0-9] | |
\w | Any alphanumeric character; short notation for character class [a-zA-Z_0-9] | |
\W | Any non-alphanumeric character; short notation for character class [^\w] |
Tutorial: A regular expression example or two to explain the possibilities
The previous sections of this article explained the fundamentals of regex. The following tutorial illustrates how these practical strings work. This tutorial illustrates various possibilities and syntax tricks using a specific regular expression example or two for both simple and complex searches.
Single-element regex
The simplest form of regex is a search pattern that only matches a single element. As long as you’re not searching for a specific element, you can easily define a single-element regular expression using a character class. The following expression allows the digits “1,” “2,” “3,” “4,” “5,” “6” or “7” as possible matches:
[1234567]
Since the numbers are consecutive in this case, you could also use the following simplified notation:
[1-7]
If you want to change the regular expression to exclude the digit “4” from the search, you can also use the simpler version with the minus sign:
[1-35-7]
the individual characters of a regex pattern are not separated by spaces.
Multi-element regex
With a multi-element regular expression, you can also use character classes to allow for a selection of different matches. For example, if you want the expression to capture two elements for which different matches are possible, simply string together two character classes:
[1-7][a-c]
The first element, a number between “1” and “7,” follows one of the letters “a,” “b,” or “c.” As already mentioned, lower case is mandatory here. Before you start using modifiers at this point, you can already include capital letters by making the following minor change to the expression:
[1-7][a-cA-C]
Regex with optional elements
Regardless of whether you search for multiple elements within a single regular expression or search with the help of multiple sets of characters, it’s possible that certain elements may or may not be included under certain circumstances. This could happen with a regular expression example that’s supposed to filter out all street numbers. In some cases, the street number may consist of a single digit, whereas in other matches, the number may consist of two or even three digits. Additionally, there may be addresses where a letter is added to the street number. You can capture the total set of possible combinations using the following regex instructions:
[1-9][0-9]?[0-9]?[a-z]?
The only mandatory element in this search pattern is a number between “1” and “9.” Two digits between “0” and “9” and any letter may follow, as indicated by the subsequent question mark in each case.
The construction for three-digit numbers plus additional letters is still very clear, but it would look much different for numbers with up to ten digits. In this case, curly brackets are recommended, as in the following regular expression example:
[1-9][0-9]{0,9}
As in the previous example, the expression must start with a number between “1” and “9.” However, this number can be followed either by no digits or up to nine digits between “0” and “9.” This means that the search result can consist of up to ten digits.
Regular expression with any number of repetitions
In the previous examples of single- and multi-element expressions, both the minimum and maximum number of characters were known. But there are also scenarios where you shouldn’t precisely define the character set of a regex in advance. In this case, the necessary parameters are the asterisk and plus signs, which allow for any number of repetitions of a character or a character class or group. You can capture all strings with any number of digits (even “zero”) using the following regular expression:
[0-9]*
The same applies if you’re searching for a specific combination of characters in which one (or more) characters can occur any number of times. As in the following example:
ab*
Possible matches include the words “apple,” “abnormal” and “abbey.” If you want to exclude the first match or if the specified character occurs at least once, you should use the plus sign instead:
ab+
Negating character classes
You have to use the negator “^” (caret) if you want to use a regular expression with character classes that represent one or more characters, but you want to exclude one or more specific characters as matches. This sign is always placed within the parentheses of a character class and only applies within these parentheses. The following instruction is a good example of a negated character class:
F[^u]n
In this example, the second character can be any character other than “u.” Matches would therefore include the word “Fan.” However, the word “fun” would not be matched, which is why it doesn’t apply to the regular expression.
Wildcards
Regex also allows you to use wildcards that represent one, more than one or no characters within a search pattern (depending on the metacharacter you’re using). You create the wildcard using a dot combined with the above-mentioned special characters for repetitions if you want a result other than a single character. A regular expression example such as this one would allow you to search a database for a person if you know the person’s first and last name but you don’t know whether a middle name was also entered for the person:
John.*Doe
In this case, possible matches would include “John William Doe” (as well as any other combination with a middle name) or “John W. Doe” and “John Doe.” If you only want to include variants with a middle name, use a plus sign instead of an asterisk:
John.+Doe
The following search pattern matches both “back” and “buck” and is a good example of how to use a wildcard for a single character:
B.ck
Alternatives
You can form a regular expression so that there are two or more alternatives for a match. The alternatives are separated with a vertical bar, as in the following example:
Tree|Flower
This expression would find matches for both “Tree” and “Flower.”
You can also use groups to form alternatives within words or strings:
(Sun|Mon|Tues|Wednes|Thurs|Fri|Satur)day
In this example, each day of the week is a potential match. All weekday names are also recognized correctly in their abbreviated form because they are grouped in parentheses.
Groups
Like character classes, the character groups in the example in the previous section are structural elements of regex. They can be defined by a pair of parentheses and basically represent a pattern consisting of one or more characters. Strictly speaking, each regex is therefore a group, but it is not identified using parentheses in this case. Groups allow you to apply operators such as hyphens or asterisks (plus sign and asterisks) to a subexpression within expressions:
ab(cd)+
In this case, the desired unlimited repetition applies to the character group “cd.” Written in the same notation without parentheses, it would apply only to the “d.” There are no restrictions on the number of groups within a regex.
Nested groups
A regular expression can not only contain any number of groups. It can also contain any number of nested groups in order to express complex relationships between simple characters and special characters without unnecessarily long strings. Possible matches for the regex pattern in the following example are the four car models “VW Golf,” “VW Jetta,” “Ford Explorer” or “Ford Focus”:
(VW (Golf|Jetta)|Ford (Explorer|Focus))
Word boundaries
If you want to include word boundaries, meaning the beginning or end of an alphanumeric sequence, in a regular expression, you have to specify this with a metacharacter. Many languages use the combination “\b,” which can be added at the beginning, end or beginning and end of a search pattern.
The first option requires that the search sequence be at the beginning of the word:
\band
The word “andromeda” is included in the matches for this regular expression example. On the other hand, the word “band” is not matched because the characters being searched are preceded by the letter “B.” To flip things around, use the second option and add the special characters at the end:
and\b
Finally, with the third option, you make both word boundaries a requirement. In this example, the only possible match is the conjunction “and.”
\band\b
Ignoring the meta-meaning of special characters
In the previous section, we used the backslash to ensure that the “b” following it was treated as a metacharacter and not as a letter. If you combine it with characters that are standard special characters for regex syntax, it has exactly the opposite effect and the character is treated as an ordinary literal. Thanks to this option, you can easily search for a specific date with a regular expression.
11\.10\.2019
In this case, the date “11.10.2019” is the only character string that matches the required search criteria. Without the backslash, the two dots would be interpreted as wildcards for any character, which is why matches such as “1101092019” or “11a10b2019” would be possible.
“Restricting” greedy regex
Quantifiers (“?,” “+,” “*,” “{}”) are the default method of ensuring that an expression is “greedy” and tries to find the longest possible match. However, since this behavior isn’t always desired, you can modify quantifiers in a regular expression to make it less “greedy.” The following example illustrates this modification process:
A.*B
When applied to the string “ABCDEB,” this greedy expression would include the entire string in the search for matches instead of stopping the search after “AB.” On the other hand, if you want the search to stop as soon as the first “B” is found, you have to use the above modification. In many languages (including Perl, Tcl, HTML), you add a question mark after quantifiers for this purpose:
A.*?B
Alternatively, you can replace the original greedy expression with the following equivalent “non-greedy” expression to arrive at the same result:
A[^B]*B
Restricting greedy regex makes processing a search pattern more complicated and increases search time.