A regular expression is a sequence of characters defining a search pattern. It is one of the most powerful tools when it comes to mass text data filtering and modification.

Regular expressions often scares the newcomers but even with a steep learning curve, regex will soon become one of your favorite tools.

The following is an example of a regular expression:

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Regular expressions originated in 1951 and have its roots deep in mathematics. Regular languages were developed by mathematicians while working in theoretical computer science, automata theory and formal languages.

In this tutorial, you will get started learning how to use grep and sed tools in Linux for match, replace and delete text using regular expressions.

Hands On

To get started, you need to have a Linux machine, if you’re on Windows, WSL will be just enough in this tutorial.

As mentioned, we’ll be using sed and grep commands in Linux. sed (Stream EDitor) is a non-interactive stream editor. The stream here refers to the Unix style of working with input and outputs of programs. sed is useful to modify text on the fly or in bulk.

grep on the other hand allows for searching plain-text data using regular expressions.

sed and grep tools will be already installed on your Linux distribution and ready to be used.

Basic Regular Expression Operations

As an example, let’s have a look at the last 30 lines of /var/log/dpkg.log:

$ tail -n 30 /var/log/dpkg.log
2021-01-16 23:38:01 status unpacked libarchive-zip-perl:all 1.60-1ubuntu0.1
2021-01-16 23:38:01 status half-configured libarchive-zip-perl:all 1.60-1ubuntu0.1
2021-01-16 23:38:01 status installed libarchive-zip-perl:all 1.60-1ubuntu0.1
2021-01-16 23:38:01 configure libmime-charset-perl:all 1.012.2-1 <none>
2021-01-16 23:38:01 status unpacked libmime-charset-perl:all 1.012.2-1
2021-01-16 23:38:01 status half-configured libmime-charset-perl:all 1.012.2-1
2021-01-16 23:38:01 status installed libmime-charset-perl:all 1.012.2-1
2021-01-16 23:38:01 configure libimage-exiftool-perl:all 10.80-1 <none>
2021-01-16 23:38:01 status unpacked libimage-exiftool-perl:all 10.80-1
2021-01-16 23:38:01 status half-configured libimage-exiftool-perl:all 10.80-1
2021-01-16 23:38:01 status installed libimage-exiftool-perl:all 10.80-1
2021-01-16 23:38:01 trigproc man-db:amd64 2.8.3-2 <none>
2021-01-16 23:38:01 status half-configured man-db:amd64 2.8.3-2
2021-01-16 23:38:21 status installed man-db:amd64 2.8.3-2
2021-01-16 23:38:21 configure libsombok3:amd64 2.4.0-1 <none>
2021-01-16 23:38:21 status unpacked libsombok3:amd64 2.4.0-1
2021-01-16 23:38:21 status half-configured libsombok3:amd64 2.4.0-1
2021-01-16 23:38:21 status installed libsombok3:amd64 2.4.0-1
2021-01-16 23:38:21 status triggers-pending libc-bin:amd64 2.27-3ubuntu1
2021-01-16 23:38:21 configure libposix-strptime-perl:amd64 0.13-1build3 <none>
2021-01-16 23:38:21 status unpacked libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 status half-configured libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 status installed libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
2021-01-16 23:38:21 status unpacked libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 status installed libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 trigproc libc-bin:amd64 2.27-3ubuntu1 <none>
2021-01-16 23:38:21 status half-configured libc-bin:amd64 2.27-3ubuntu1
2021-01-16 23:38:23 status installed libc-bin:amd64 2.27-3ubuntu1

Now let’s try to print only the lines where a package configuration has occured:

$ tail -n 30 /var/log/dpkg.log |grep configure
2021-01-16 23:38:01 status half-configured libarchive-zip-perl:all 1.60-1ubuntu0.1
2021-01-16 23:38:01 configure libmime-charset-perl:all 1.012.2-1 <none>
2021-01-16 23:38:01 status half-configured libmime-charset-perl:all 1.012.2-1
2021-01-16 23:38:01 configure libimage-exiftool-perl:all 10.80-1 <none>
2021-01-16 23:38:01 status half-configured libimage-exiftool-perl:all 10.80-1
2021-01-16 23:38:01 status half-configured man-db:amd64 2.8.3-2
2021-01-16 23:38:21 configure libsombok3:amd64 2.4.0-1 <none>
2021-01-16 23:38:21 status half-configured libsombok3:amd64 2.4.0-1
2021-01-16 23:38:21 configure libposix-strptime-perl:amd64 0.13-1build3 <none>
2021-01-16 23:38:21 status half-configured libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
2021-01-16 23:38:21 status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 status half-configured libc-bin:amd64 2.27-3ubuntu1

As we can see, grep has listed all lines containing the word configure, even those with half-configured. Let’s fine-tune the search to get only those with configure (space at the start) and then print all lines containing libsombok or libposix:

$ tail -n 30 /var/log/dpkg.log |grep ' configure' | grep -E 'libsombok|libposix'
2021-01-16 23:38:21 configure libsombok3:amd64 2.4.0-1 <none>
2021-01-16 23:38:21 configure libposix-strptime-perl:amd64 0.13-1build3 <none>

Notice the -E switch with the second grep, it’s telling it to use extended regex, the binary OR is done using the '|' character.

'libsombok|libposix' will match lines with either libsombok or libposix. You can combine any number of choices in the expression.

Searching plain text words is quite easy as you have seen. This is only the basic of regular expression. Let’s go deeper with some more examples.

Detecting Digits

Let’s now delete all numbers formatted in hours with sed. The following is the output without deleting hours:

$ tail /var/log/dpkg.log
2021-01-16 23:38:21 status unpacked libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 status half-configured libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 status installed libposix-strptime-perl:amd64 0.13-1build3
2021-01-16 23:38:21 configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
2021-01-16 23:38:21 status unpacked libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 status installed libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16 23:38:21 trigproc libc-bin:amd64 2.27-3ubuntu1 <none>
2021-01-16 23:38:21 status half-configured libc-bin:amd64 2.27-3ubuntu1
2021-01-16 23:38:23 status installed libc-bin:amd64 2.27-3ubuntu1

Now let’s apply the sed command:

$ tail /var/log/dpkg.log | sed  's/[0-9][0-9]:[0-9][0-9]:[0-9][0-9]//'
2021-01-16  status unpacked libposix-strptime-perl:amd64 0.13-1build3
2021-01-16  status half-configured libposix-strptime-perl:amd64 0.13-1build3
2021-01-16  status installed libposix-strptime-perl:amd64 0.13-1build3
2021-01-16  configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
2021-01-16  status unpacked libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16  status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16  status installed libunicode-linebreak-perl:amd64 0.0.20160702-1build2
2021-01-16  trigproc libc-bin:amd64 2.27-3ubuntu1 <none>
2021-01-16  status half-configured libc-bin:amd64 2.27-3ubuntu1
2021-01-16  status installed libc-bin:amd64 2.27-3ubuntu1

Looking at the regex [0-9][0-9]:[0-9][0-9]:[0-9][0-9], we can see that it is a search pattern for hours formatted as XX:XX:XX where XX are digits. The s/ is for substitute and the // at the end is to tell sed to replace the matching text with nothing, as there is nothing between the two slashes. So it will delete the content matched.

Next, we will modify the regex to make it match the date in YYYY-MM-DD:

$ tail /var/log/dpkg.log | sed  's/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]//'
 23:38:21 status unpacked libposix-strptime-perl:amd64 0.13-1build3
 23:38:21 status half-configured libposix-strptime-perl:amd64 0.13-1build3
 23:38:21 status installed libposix-strptime-perl:amd64 0.13-1build3
 23:38:21 configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
 23:38:21 status unpacked libunicode-linebreak-perl:amd64 0.0.20160702-1build2
 23:38:21 status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
 23:38:21 status installed libunicode-linebreak-perl:amd64 0.0.20160702-1build2
 23:38:21 trigproc libc-bin:amd64 2.27-3ubuntu1 <none>
 23:38:21 status half-configured libc-bin:amd64 2.27-3ubuntu1
 23:38:23 status installed libc-bin:amd64 2.27-3ubuntu1

Replacing Values

Instead of deleting the date, we will now just hide it by labelling it as confidential:

$ tail /var/log/dpkg.log | sed  's/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/CONFIDENTIAL/'
CONFIDENTIAL 23:38:21 status unpacked libposix-strptime-perl:amd64 0.13-1build3
CONFIDENTIAL 23:38:21 status half-configured libposix-strptime-perl:amd64 0.13-1build3
CONFIDENTIAL 23:38:21 status installed libposix-strptime-perl:amd64 0.13-1build3
CONFIDENTIAL 23:38:21 configure libunicode-linebreak-perl:amd64 0.0.20160702-1build2 <none>
CONFIDENTIAL 23:38:21 status unpacked libunicode-linebreak-perl:amd64 0.0.20160702-1build2
CONFIDENTIAL 23:38:21 status half-configured libunicode-linebreak-perl:amd64 0.0.20160702-1build2
CONFIDENTIAL 23:38:21 status installed libunicode-linebreak-perl:amd64 0.0.20160702-1build2
CONFIDENTIAL 23:38:21 trigproc libc-bin:amd64 2.27-3ubuntu1 <none>
CONFIDENTIAL 23:38:21 status half-configured libc-bin:amd64 2.27-3ubuntu1
CONFIDENTIAL 23:38:23 status installed libc-bin:amd64 2.27-3ubuntu1

This is very handy, especially when sharing logs with third parties for debugging.

Let’s do more examples.

Finding All Text Starting with < and ending with >

.* is probably one of the most used pattern in regex to match any character, repeated any number of times. In the example below, we use it to detect any word between < and >:

$ grep -o '<.*>' /var/log/dpkg.log | tail -n 5
<none>
<none>
<none>
<none>
<none>

The -o switch is used to print only the matched text. Otherwise, the whole line containing the matching text will be printed.

Useful Sysadmin Filters

Here are a few interesting and useful regexes often used by sysadmins. Let’s create a test file containing an email and an IP address with the following command:

$ echo 'My email is example@domain.com and my IP is 127.0.0.1' > email
$ cat email
My email is example@domain.com and my IP is 127.0.0.1

Finding Emails

While extracting emails from text seems simple, write a regex to match a valid email is quite complex:

$ grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" email
example@domain.com

Now that was a regular expression! Let’s break it down:

1. Using \b at the beginning and end of the expression will match a word and not the whole line. \b is a word boundary.

2. [A-Za-z0-9._%+-]+@ is one or more alphanumeric what might also contain ._%+ or - followed by a @.

3. [A-Za-z0-9.-]+\.[A-Za-z]{2,6} will match the domain name, here the TLD (Top level domain) can be from 2 to 6 letters only. {} denotes a repeater and {2,6} means from 2 to 6 times.

Note that this regular expression of emails is not a perfect match for all valid emails but will work in most cases.

Finding IPv4 Addresses

Following the same logic as the previous example, we can now match an IPv4 address:

$ grep -E -o  "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" email
127.0.0.1

Conclusion

POSIX is a standard and have defined some classes for regular expressions. For example:

– Matching digits can be done with [:digit:] instead of [0-9].

– Matching lower case alphabets can be done with [:lower:] instead of [a-z]

Depending on the tools you are using, you might need to use a specific switch to activate POSIX mode, check out POSIX basic regular expression  for more details.

Regex will definitely be your swiss army knife while administrating your servers. This tutorial has covered some basic and intermediate usage of regular expressions to extract and filter plain-text data in files. Check out the Regular expression page on Wikipedia to learn more about the possibilities of regex.