Regex: what is it and how to get started?

Regular expressions (or simply regex) are a more efficient way of searching text or characters in any data set. They are special text strings used to create search patterns to match, find, and manage text.

By Tricent · April 9, 2021

“Is there a more powerful way to search text or characters than Ctrl+F? Or even Ctrl+H?”

— Meet Regex.

Regular expressions (or simply regex) are a more efficient way of searching text or characters in any data set. They are special text strings used to create search patterns to match, find, and manage text.

When it comes to DLP (Data Loss Prevention), using regex as custom detectors can be a more efficient way of searching for a more complex combination of characters that would go undetected if using a word list for instance.

Have a look at the following example:

Before

After

A DLP word list would have required us to list all the misspelt variations of the word “sensitive”, but through a regular expression, we can detect any misspelt words much faster and more precisely (see illustrations).

Regex for custom social security numbers

Google hasn’t built default detectors for different social security formats around the world. So, in this example, we’ll build one using the Danish social security number (also known as “yellow health card”).

The Danish yellow health card consists of 10 digits. The first 6 digits represent the date and year of birth, whilst the last 4 digits are randomly allocated. The digits are separated by a hyphen.

In this example, we use the following regex:

/\d{6}\s{0,1}-?\d{4}/gm

\d – matches any digit from 0-9
{6} – matches 6 digits
\s – matches a space
{0,1} – indicates that a space can occur zero or one time.
-? – indicates that the hyphen is optional
\d{4} – matches 4 digits
gm – means the regex matches case insensitive and matches any part of text lines.

This expression guarantees us that our DLP engine is able to detect any Danish social security number. That’s something a word list wouldn’t be able to do!  

Fun fact:

Did you know that the last digit in a Danish health card tells you the gender of the person?
Even numbers are used for females, whilst odd numbers are used for males. In this case, we know that Lone is a woman.  

Tip:

  • Use regex101 to test your regular expressions.

Check more examples of regex at: https://support.google.com/a/answer/1371417

Try our interactive demo