A guide to regular expressions

8 March 2020

Note: a good place to try out the examples in this article and experiment with regexes is Regex101.

Introduction

Regular expressions (or ‘regexes’) are a kind of code you can use to target certain words or characters in some text.

For example, if you want to find all instances of the word “JavaScript”, you could use this regular expression:

/JavaScript/

At its most basic, a regex is just some characters between two forward slashes.

Some of the JavaScript methods that you can use with regular expressions are:

test
search
replace
match
split

You could use the test method if you needed to filter news article headlines for just those related to a certain topic:

const headlines = [
  "No 10 baby on the way",
  "No-go zones and 'volunteer army' to fight virus",
  "Cash for flood defenses to be doubled",
  "Are flying taxis ready for lift-off?",
];

const regex = /virus/;

const filteredHeadlines = headlines.filter((headline) => regex.test(headline));

The test method returns true or false depending on whether the string contains the regex so the filteredHeadlines array would contain only headlines with the word ‘virus’. (If you’re simply checking if a a string contains a certain word, like in this example, you could also use the String.includes method).

While we can use normal letters in regular expressions like above, this would be quite limited. Regular expressions are powerful because we can be more general or more specific.

Most useful regex symbols

\d - this matches any digit (i.e. number)

\w - this matches any ‘word’ character (slightly misleading because it actually matches any letter, number or underscore)

\s - whitespace (space/tab/return/new line - use an actual space if you don’t want to match these other types)

For the opposite of these, just capitalise the letter:

\D - anything that isn’t a number

\W - anything that isn’t a ‘word’ character (i.e. not a letter, number or underscore)

\S - anything that isn’t whitespace

The backslash is necessary because, if you didn’t include it, it would just mean the letter d or the letter w or the letter s.

Flags

At the end of the regex, after the last forward slash, we can add flags. These are the most important to know:

g - this stands for ‘global’ and means it will find all matches (it only finds the first instance by default so you’ll usually use this)

i - this stands for ‘case insensitive’ (you’ll generally want this as well)

One or more

So if we wanted to extract a number from a string, we could do it like this:

const string = "I was born in 1988";

const regex = /\d/g;

const extractedNumbers = string.match(regex);

The match method returns anything that matches inside an array, so we get [“1”, “9”, “8”, “8”].

But what if we don’t want the numbers to be separated like this? Well, we can add a symbol that tells the regex to match not just single numbers, but sequences of numbers of any length. We can do this with the plus symbol (+) which means ‘one or more of the preceding character’:

const regex = /\d+/g;

Our function would now return [“1988”].

This many

What if we only wanted to only extract, for example, years from the string? The above regex would match any sequence of numbers, including phone numbers for example. Well, years are usually 4 digits long so we can tell it specifically to match only sequences of 4 digits:

const regex = /\d{4}/;

If we wanted to match phone numbers and we know that phone numbers can have between 9 and 11 digits, we can specify the range like this:

const regex = /\d{9,11}/;

If you know the minimum number of digits there should be but you don’t want to put a maximum limit, just leave out the second number (but leave in the comma):

const regex = /\d{9,}/;

Any of these

If you had an input field on your website for a name, you could use a regular expression to validate it. Most people’s names consists of two sequences of letters so this regex would cover all names of that type:

/[a-z]+\s[a-z]+/gi

Here the square brackets mean - ‘any of these’. So, in this case, the square brackets with the plus sign mean a range of letters from a to z of any length.

So this regex would be able to capture my name (Tom McAndrew) but what about names with apostrophes or hyphens like Bill O’Reilly or Daniel Day-Lewis? We can fix this quite easily by incuding a hyphen and apostrophe within both pairs of square brackets (as these characters could also possibly appear in a first name).

/[a-z'-]+\s[a-z'-]+/gi

None of these

Another very useful regex symbol is ^. When it’s used inside square brackets, it means ‘none of these’. For example:

/[^f]ork/gi

This says, match any character, except ‘f’, followed by ‘ork’. So this would match ‘pork’, ‘york’ and ‘cork’ but not ‘fork’.

Optional characters

We can say that a character/set of characters is optional by using a question mark (?) after it:.

For example, if we wanted to match the word ‘colour’ in a text but we know it can be spelled with or without a ‘u’, we could do it with this:

/colou?r/gi

Zero or more

We saw earlier that the plus sign means ‘one or more’ of the preceding character. If we want to say ‘zero or more’ of the preceding character, would use the asterisk:

/go*gle/gi

This would match:

ggle
gogle
google
gooogle
goooogle

and so on.

This could be useful if you wanted a search function to work even if the user misspelled the search term.

WILDCARD

The dot (.) is a ‘wildcard’ and can represent any character.

/.+\.js/g

This would match any sequence of characters followed by ‘.js’, so it would capture all JavaScript files. Note that we ‘escape’ the second dot with a backslash as it represents a literal dot and not the wildcard.

Start/end of string

If you remember from before, when used inside square brackets, the ^ symbol means ‘none of these characters’. However, when not inside square brackets, it’s used to specify that our regex pattern must appear at the very start of a string. Similarly, to specify that something must be at the end of a string, we use the dollar sign ($).

For example, if you wanted to match a phone number and wanted to make sure it had no spaces or letters or anything else before or after it, you could use this:

/^\d{9,11}$/g

Lookaheads & lookbehinds

Finally, we can also specify in a regex that a pattern must appear directly before or directly after something else. For these situations we use either a ‘lookahead’ or a ‘lookbehind’. Imagine we have a list of names:

const names = [
  "Rob Harrison",
  "Brenda Grayson",
  "Frank Sheridan",
  "Molly O'Keefe",
  "Simon Harrison",
];

If we wanted to extract the first names of people with the surname ‘Harrison’, a lookahead would be very useful.

The syntax of a lookahead is a question mark and an equals sign surrounded by parentheses:

(?= regex goes here)

So our full regex will say that there must be a sequence of letters followed by a space and then the word ‘Harrison’:

const regex = /[a-z]+(?= harrison)/gi;

We can create an array of the first names that match like this:

let firstNames = [];

names.forEach((name) => {
  if (name.match(regex)) {
    firstNames = [...firstNames, ...name.match(regex)];
  }
});
// -> ["Rob", "Simon"]

If we wanted to get the first names of everyone who does NOT have the surname Harrison, we can use a negative lookahead. The only difference in the syntax is that we use an exclamation mark (!) instead of an equals sign:

const regex = /[a-z]+(?= [a-z])(?! harrison)/gi;

This matches any sequence of letters which is followed by a space and another sequence of letters, but which is NOT followed by a space and the word ‘Harrison’.

Lookbehinds are very similar. We just need to include one additional symbol - the ‘less than’ symbol (<):

(?<= regex goes here)

To get the surnames of everyone called Frank in our names array above, we can use this regex which matches any sequence of letters that follow the word ‘Frank’ and a space:

/(?<=frank )[a-z]+/gi

To make it negative, change the equals sign to an exclamation mark:

/(?<!frank )(?<= )[a-z]+/gi

This matches any surname that does not come after the word ‘Frank’ and a space BUT does come after one space.

IMPORTANT: lookbehinds are unfortunately not currently supported by Firefox, Safari or IE.

Read world example #1: Validating an email address

We now know enough to be able to write a regex that matches email addresses. Note that there is no single universally accepted regex for matching email addresses but we can write one that will work in the vast majority of cases.

Let’s look at a few example email addresses and break the format down into parts.

joe@gmail.com

harry.smith@hotmail.co.uk

i-love-coding_1988@company-name.business

Recipient name:

This can be a combination of up to 64 letters, numbers, dots or special characters. So a simple solution would be to say ‘anything except a space’:

^[^\s]{1,64}

Next will simply be the @ symbol:

Domain name:

This can be a combination of up to 253 letters and numbers and can also contain a hyphen or dot (for sub-domains):

[a-z0-9-\.]{1,253}

TLD:

This can have one part or two parts (e.g. .com or .co.uk). If there are two parts, the first part will be captured by the above section of our regex (as it includes letters and dots) so we don’t need to worry about that.

So we just need to add a bit to the regex which will make sure that the string ends with ‘.uk’ or ‘.com’ or ‘.info’ etc. According to Google, the longest this can be is 24 characters and it can’t be shorter than 2. I’m also pretty sure it can only contain letters so we can write:

\.[a-z]{2,24}$

So the whole thing would look like this:

/^[^\s]{1,64}@[a-z0-9-\.]{1,253}\.[a-z]{2,24}$/gi

You can test this yourself on Regex101 by pasting in email addresses and checking that they match.

Real world example #2: Extracting sentences from a text

Another situation where you might need to use a regex is if you have a text and you need to extract the individual sentences.

There are a couple of possible ways to do it. One way is with the .split method and a lookbehind. Sentences are usually divided with either a full-stop, a question mark or an exclamation mark. So we could tell the split method to split the string on any space which comes directly after one of these punctuation marks:

const string = "My name is Tom. I live in London! Where do you live?";

const sentences = string.split(/(?<=[.!?])\s/gi);

This works fine but, as I noted earlier, lookbehinds are only supported by Chrome so, a more cross-browser compatible solution would be to use the match method which can look for a sequence of characters (other than these three) which is then followed by one of these three:

const sentences = string.match(/[^.!?]+[.!?]+/gi);

This was just an overview of the main features of regular expressions.