User-Level Intro to Linux

Module #07 - Regular Expressions

This module will complement the previous ones devoted to filters by introducing Regular Expressions. We are going to get you started working with them via multiple activities which will provide you both the "big picture", the more pragmatic "I need some tools" knowledge, and get you used to apply regular expressions to solve, as well as design, elementary problems.

Topics Studied in this Module

T1 - Regular Expressions
RegExps are available to use in various Linux command line tools but also in most programming languages. We are going to review their benefits from the perspective of filters but this is definitively a topic you will encounter many times in your IT carreer.

Learning Activities

This module features the following learning activities; DF, PA, GQ

Linux+ Exams Objectives

This module deals with material related to the following objectives from the CompTIA Linux+ and LPI certifications exams.

  • Exam LX0-101 - Objective 103.7 - Search text files using regular expressions

Let us start with a reading assignment. We are going to use one of Ryan Chadwick Linux tutorials.

We are now going to supplement the above reading assignment with a few videos in order to provide a hands-on demonstration of the concepts that were covered.

  • grep review
  • Video Link: YouTube
  • Let us start with a review of the most basic way to use the grep tool.
  • Basic regular expressions
  • Video Link: YouTube
  • We introduce basic regular expressions and illustrate their usage with the grep tool. The video discusses various syntax to match single characters and neutralize characters with a special meaning in a regular expression. It then discusses the idea of anchors.
  • Matching character classes
  • Video Link: YouTube
  • This video provides an overview of available character classes which make it easier to describe sets of characters you are trying to match. We are still focused on the grep tool and its basic regular expressions
  • Repeating patterns & capturing groups
  • Video Link: YouTube
  • We add repeating patterns & capturing groups to the toolkit of basic regular expressions with the grep tool.
  • Extended Regular Expressions
  • Video Link: YouTube
  • We now switch to using the egrep tool and establish what extended regular expressions allow us to do.

Please note that it is highly advisable for you to work on this DF assignment before you attempt the PA.

Topics - Review Regular Expressions Crafting Tools

Regular expressions are hard to both design and read. The syntax is often a big part of the difficulty in learning to use them. This is why most programmers, or system administrators, use software tools specifically designed to "simulate" how a regular expression will match some specific text. The matches are usually highlighted to help even further.

In order to make the other learning activities in this module a bit more pleasant, your first order of business will be to search for a free or open source version of such a tool, install it on Linux or Windows, and use it as you work on this module's PA assignment.

You will find a list of links to such tools below. This list is by no means comprehensive. You are encouraged to search around for better tools, but you will have at least some examples of what you should be looking for.

Requirements for your Posts

As you download and install one of these tools, post a new thread in this module's DF forum, with its name as title. Make sure you provide the URL you used to download it. You are expected to post your reviews about the tool you used in its respective thread. Focus on

  • Ease of installation, availability on multiple platforms, GUI usability
  • Family of regex supported by the tool; e.g., POSIX Basic (BRE), POSIX Extended (ERE), Perl Compatible (PCRE)
  • How they are suited to what we study in the lectures
  • How they are suited to what we need for the PA

The key here is to provide details; e.g. post

  This tool is not good because it doesn’t support BRE, 
  therefore it is usable only with grep exercises but I 
  was unable to do anything with it when working on the 
  quiz question about the phone numbers because it 
  required an egrep solution

Instead of…

  This tool is not good

Focus on your efforts to research new tools, and the relevance of your posts to evaluate their usefulness.

Examples of Regular Expressions Tools

This PA will provide you with a series of mini-exercises focused on designing a regular expression. The regular expression will have to be able to match an explicitly defined pattern. In other exercises, the pattern will be implictly defined with the help of two series of examples; things to match vs. to not match.

Exercise #1 - Grab bag

Here is a text file with a series of sentences, one per line;

  This is a good idea 
  What about this?
  this goes here...
  ...here goes this
  Middle of this...
  In java this.methodname works fine

As you see, the word "this" appears on every single line. Paste the above-information into a textfile in your Linux virtual box. Take a look at the following regular expressions;

  This
  [Tt]his
  this 
  this$
  ^this 
  ^this$
  \bthis\b
  \<this\>

Your task is to take each of these regular expressions, one by one, and identify which of the above lines they will match if used with egrep.

Once you have identified the lines a given regular expression matches, try it out in your Linux virtual machine to verify that you were right. For each error you made, take the time to identify exactly why the regular expressions did not match. Make sure that you use our forums to ask about anything you are not able to explain. Do not satisfy yourself with a vague, approximation of an explanation. This kind of practice only works if you make proper effort to learn from any mistake and seek help in doing so.

For each regex, submit the regex that you designed, along with a list of the sentence(s) that it matches. You will also briefly explain why it did or did not match in each case. Please note that you do not have to submit any shell output from you testing the regex.

Exercise #2 - Phone Numbers

Is it possible to talk about regular expressions without bringing-up examples involving phone numbers? yes. However, we will still use this traditional scenario in this practice exercise :)

We want to write a regular expression which will only match what we define to be valid phone numbers. While the digits themselves are irrelevant, each being between 0 to 9, it is their grouping and the various symbols used to do the grouping that are of interest to us.

Take a look at the following list of valid phone numbers to get an idea of the syntax we accept. All of the following should be matched by your regular expression;

  555-667-7088
  555 667 7088
  (555)667-7088
  (555) 667-7088
  (555) 667 7088
  555    667-7088

Please note that the last line uses a single tab to separate 555 from 667. The general rule is that we accept a single space, or a single tab, to separate groups of digits

To help you understand the syntax we are accepting, let us take a look at some examples of badly structured phone numbers. None of these should be matched by your regular expression;

  (555 667-7088
  (555)-667-7088
  555  667-7088

Mismatched parentheses are definitively not acceptable. Neither are using the dash symbol after a parenthesis. The last line looks similar, on this webpage, to the last line of our valid examples list above. It is not. Instead of a single tab, we used here multiple spaces. Multiple spaces or tabs are not allowed to separate the groups of digits.

Paste both lists in two separate text files respectively named positive.txt and negative.txt. You will then design your regular expression and test it with egrep. You need to be able to match all lines in the positive.txt file, and none of the lines in the negative.txt one.

You are invited to add both positive and negative examples to better illustrate the definition of valid phone numbers.

Submit a screenshot of your command, as well as the output from when you test it on both files. Make sure that your regex correctly matches all the positive examples and none of the negative examples.

As usual, make sure your full name appears in the terminal for your screenshots.

Exercise #3 - URLs

We want to design a regular expression which allows us to match with egrep all the lines of a text file which \ contain a URL. We define what a URL is explicitly as follows;

A URL starts with the keywords http or https writen in either all upper or all lower case. These are then followed by :// and a hostname itself followed by a /. After the hostname, we have an optional pathname with a filename at the end.

We will assume the following;

  • The hostname ends with a valid domain name; i.e. com, org, edu, or info
  • Before that, it is made of one or more alphanumerical names, each followed by a single dot
  • A pathname is made of one or more folder names, each separated by /
  • A folder name made of one or more alphanumerical characters. Any alphabetical part of the filename might be in upper or lower cases
  • A filename is made of one or more alphanumerical characters, followed by a single dot, followed by exactly 3 alphabetical characters. Filenames might be in upper or lower cases

Submit the command you designed, it can be in a screenshot or you can simply write it down in your pdf. Make sure that you test your regex thoroughly using a number of tests. In order for you to receive full credit, your regex must satisfy all of the above-specified conditions. The grader will use a list of tests designed to see how robust your regex is in handling edge cases. Here are some examples of what to look out for;

  • A URL which starts with a mix of upper and lower case letters, like hTTpS.
  • A valid URL beginning followed by :/ instead of ://
  • A URL without a valid domain name at the end, or a domain name with non-alphanumeric characters like !com&
  • A pathname where folders are separated by \ rather than /
This is not an exhaustive list, but it should give you an idea of some of the things you will want to test before you submit your answer.

Exercise #4 - Java Comments

There are two forms of comments in Java.

  • The one line comments start anywhere in a given line with // and end at the end of the line
  • The multi-lines comments start anywhere in a give line with /* and end anywhere in a following line with a */

However, sometimes, /*...*/ are used as one line comments; e.g.

  int data = 42; /* this is the magic number */

Regardless of whether the */ is right at the end of the line or followed by a few spaces or tabulations, we might have used a // comment instead in such situations.

Design a regular expression to allow us to use egrep to extract all lines from an arbitrary java file which are using such single-line /*...*/ comments. HINT - try to find counter examples actively to your solution and discuss them on the forums with other students

When you are sure your regular expression works then, you may pipe the result to another tool which will allow us to determine how many such lines were in the java file. We will charge the programmers in charge of developing that file 1 cookie per such line.

Submit a screenshot showing both the regex you designed as well as the command you used to count how many such lines occur in a java file. For this task, you can assume that the java file has no syntax errors and can compile and run without issue. Make sure that you test your command on some actual java code. You will be graded based on both the correctness of your regex and on your command’s ability to successfully count the correct number of single line /* */ comments.

This tab provides you with optional resources meant for those who want to learn a bit more about the topics covered in this module.

Harley Hahn's Guide to UNIX and Linux

If you are interested in learning more about the Linux File System, I recommend you look for Harley Hahn's Guide to UNIX and Linux. It is unfortunately out of print but is a great (and entertaining) read about the topic.

The following sections of the above are particularly relevant to this module;

  • Section 20: Regular Expressions