Regular Expression: Learn, Understand & Create Your Own

What is Regular Expression?
Regular Expression is a set of rules/patterns that describe a certain amount of string and can be used to search & manipulate that string within the input text.
A regex (abbreviation of Regular Expression), has a special syntax which is understood by Regex Engine.
The best part is, it is supported by most of the programming language like .Net, PHP, JavaScript, Ruby, Perl etc. They all have small variations which are based on their regex engine implementation and library support.

History (you can skip)
An American mathematician Stephen Cole Kleene invented Regular Expression in 1956 using a mathematical notation called as Regular Set. He is known as the father of Regular Expression.
First time, Ken Thompson, American pioneer of computer science, implemented a regular expression search algorithm for his text editor in 1968 using that Regular Set.

About ‘Regex Engine’

  1. As we all know in programming, an engine is a piece of code building a software and process some inputs & produces some outputs. Similarly, regular expression engine a piece of software that is used by the programming language to process the regular expressions. It takes text/string & pattern as inputs and produces outputs based on the pattern matching.
  2. Different programming languages have their own engines & not compatible to each other and, they have some variations in the pattern creating but the output they produce is same if you have written a right regular expression following their syntax and conversion rules.
  3. In one sentence, different regex engines are very similar but not identical.
  4. Open source PCRE (Perl compatible regular expression) engine is used by most tool and by PHP. Dot Net has their own regular expression library. Java has their regular expression package in Java JDK (FREJ library, Fuzzy Regular Expression for Java). JavaScript used XRegExp library for a regular expression.
  5. Now, our concern is not to understand the engine but we are here to understand the patterns we create for engine and engine process that pattern.

What a Regular Expression does?
Before learning anything, we must know the answer to “Why” question.
So, here is the question: Why regular expression? What does it do? What are the benefits of regular expression?
And the answer is, it performs following operations-

  1. Matching/Finding a string pattern.
  2. Find a string pattern and replace it in the input text.
  3. Validating input data.
  4. Parsing HTML, XML data.

To do all these operations, we only need a regex pattern.

Here, we will learn how to create regex pattern and we will understand the pattern, how it will validate the user input data.

As you all know, most of the project has forms and accept some user inputs through input fields. Each input field is designed to accept some specific data. What I mean is, if form has an input field for name then it must accept only alphabets, not digits or any special characters. A mobile number input field must accept a pattern of mobile numbers with digits and/or digit + some special characters like small braces or hyphen (it may be 9953933079 or 995-393-3079 or 995(393)3079 or something like that) but not alphabets or other special characters. An email input field must accept a pattern of email address. Similarly, a pin code input field, a URL input field, a password input field, a user id/username input field, an amount input field, a date input field have a specific pattern of data.

All above can be validated with regular expression by just writing one-line code.

Understand Regular Expression:
As I stated earlier that Regex has a special syntax that must be understandable for regex engine. If the syntax is not written in the correct format, the engine will produce wrong output or no output.

There are some metacharacters defined for regex and those have special meaning for regular expressions.

  1. Metacharacters: Following are the metacharacters of the regular expression-
    1. Square brackets, curly brackets, round brackets, backslash, caret sign, dollar sign, period or dot, vertical bar or pipe operator, question mark, asterisk or star sign, plus sign, hyphen.
    2. [ ], { }, ( ), \, ^, $, ., |, ?, *, +, -
    3. I will explain each metacharacter, their meaning & uses.
    4. For now, here is the brief information about each metacharacter.
    5. Square brackets: used for character class or to define a set of characters. E.g. “[aeiou]” will match any character of a,e,i,o,u
    6. Curly brackets: used to define a specific number of occurrences of a single character or group of characters. E.g. “ a{2,5}” will match “aa”,”aaa”,”aaaa”,”aaaaa”
    7. Round brackets: used to define a group of characters. E.g. “[aeiou](a{2,5})”
    8. Backslash: used to escape metacharacter in the regex pattern. E.g. “a\{“ will match “a{“
    9. Caret: used to match a start character or group of characters of a line. Also used for negation with square brackets. E.g. “^(An)” will match “Anand”,”And” but not “Apple”
    10. Dollar: used to match end character/group of characters of a line. E.g. “(and)$” will match “Anand”, “and” but not “Android”
    11. Period or dot: match any character except a new line. E.g. “a.b” will match “axb”,”ayb”,”azb”, “a.b”
    12. Pipe operator: used as a OR operator. Match first | second. E.g. “Raj|Anand” will match “Raj” and “Anand” both.
    13. Question mark: used to make character or group of characters optional. E.g. “Get(Values)?” will match “Get” and “GetValues”
    14. Asterisk: used to match zero or more than zero occurrences of a character or character set. E.g. “and*” will match “an”,”and”,” andd”, “anddd” etc.
    15. Plus sign: used to match 1 or more than 1 occurrences of a character or character set. E.g. “and+” will match “and”,”andd”,”anddd” etc.
    16. Hyphen: used with square bracket to describe a range of characters. E.g. “[0-9]” will match any digit from 0 to 9
  2. Literal Characters: Any character except than the metacharacters in the regex.

    Example 1:
    Regex: “and”
    Input: “Hey guys, I am Anand and I have an Android phone.”
    Match result: 2 “and”
    Literal chars: and
    No meta chars.

    Example 2:
    Regex: “^(and)”
    Regex Explanation: match string which start with “and”
    Input1: “Hey guys, I am Anand and I have an Android phone”.
    Input2: “android phones are good.”
    Match result1: no match.  Match result2: 1
    Literal chars: and
    Meta chars: ^, ()

  3. Single Character Match Patterns: Followings are some regex patterns which process and match each single character of the text string.

      .

    Period or dot: match anything except line breaks.

      \d

    Match a digit in 0123456789

      \D

    Match a non-digit character.

      \w

    Match a “word”: letters, digits & underscore _

      \W

    Match a non-word

      \t

    Match a tab

      \r

    Return

      \n

    New line

      \s

    Match a whitespace characters: space, \t, \r, \n

      \S

    Match a non-whitespace character


    Example 3:
    Regex: “\d”, Input: “I have Rs. 500”
    Match result: 5,0,0

    Example 4:
    Regex: “\w”, Input: “I have Rs. 500”
    Match result: I,h,a,v,e,R,s,5,0,0

    Example 5:
    Regex: “\W”, Input: “I have Rs. 500”
    Match result: 3 white spaces between each word, one period (.) match

  4. Character Classes: Square brackets are used to describe a character class. Regex finds & match one of the characters of character class. It is a set of characters inside the brackets.

    Example 6:
    Regex: “[aeiou]”
    Input: “Find vowels in this sentence.”
    Matches: i,o,e,i,i,e,e,e 

    Negation: if you place caret inside the bracket, it will treat as ‘not in character class’.
    Thus, regex pattern “[^aeiou]” will match on all characters except i,o,e,i,i,e,e,e
    Note: [a^eiou] does not mean that either ‘a’ character or neither one of ‘eiou’. Instead, it will include caret ^ as literal char.
    Range Chars: We can define a range characters inside the brackets to match any one of that range chars.
    For Example: [0-9] will match any digit from 0 to 9 (0,1,2,3,4,5,6,7,8,9)
    Similarly, [0-9a-zA-Z] match any digit or any lower-case chars or any upper-case chars.
    Note: here hyphen – has a special meaning inside the brackets but it will be a literal char outside the bracket. And if you want to include hyphen in the character class then use it with backslash like [0-9\-]

  5. Boundaries: Boundary characters are helpful in Anchoring the pattern to some edge but do not select any character themselves.

      \b

    Word boundaries: as defined as any edge between a \w and \W

      \B

    Non-word-boundaries

      ^

    The beginning of the line

      $

    The end of the line


    Example 7:
    Regex: “\band\b”
    Input: “Hey guys, I am Anand and I have an Android phone.”
    Match result: single “and”

    Example 8:
    Regex: “\Band”
    Match result on above input: only “and” of Anand

    Here caret and dollar sign are very important. Suppose I must match a ‘user_name’ data field.
    A ‘user_name can be alphanumeric but cannot start with a digit.

    Example 9:
    Regex: “^[a-zA-Z_]\w*”
    Inputs: anand420, 420anand
    Match: anand420

    Example 10:
    Regex, restrict to have digit in the last: “^[a-zA-Z_](\w)*(\d)$”
    Input: anand420, anand_420_kumar, _anand420, anand_420
    Match: anand420, _anand420, anand_420

  6. Repetition Quantifiers: We can tell regex engine to match a repetition of Single Character Match or a Group of Characters Match.

      X*

    Zero or more repetition of X

      X+

    1 or more repetition of X

      X?

    Zero or 1 occurrence of X

      X{n}

    Exactly n times of X repetition

      X{n,}

    At least n times of X repetition

     X{n,m}

    At least n times and at most m times of X repetition


    Note: by default, these quantifiers are applied to one character but we can define it for a group of characters using round bracket (…)

    e.g. “ab+” match “ab”,”abb”,”abbb” but “(ab)+” match “ab”,”abab”,”ababab”
  7. Alternation: A pipe operator is used to provide an alternate match for regex. If you want to match “men” and “women” then you can write “men|women”.
  8. Backreferences: A matched set of characters can be referenced further in the regex expression. Put you matched character set inside the round brackets and count every open round bracket from left and give it to a number. And later, in the expression, that set of character can be referenced with its number like \n where n is the nth open bracket.

    e.g. match repetition of a word “\b(\w+) \1\b” will match “the the”, “and and” 

    Note: Remembering part of the regex match in a backreference slows down the regex engine and by default, round brackets always create backreferences. Now, if you do not have used backreferences in the regex expression then you can tell the engine to ignore backreferencing of parentheses, i.e. called non-capturing parentheses.
    Suppose I have regex “\b(\w+)\b” then regex will automatically create backreference for the part “\w+” because it is inside the round brackets. It does not matter whether you have used that reference or not. If you have not used it then simply tell the regex engine to not create backreference i.e. “\b(?:\w+)\b” and now “\b(?:\w+) \1\b” will be an invalid regex expression. 

    Regex engine match and store the part of the expression written inside the round brackets.
    e.g. regex “(.)(.)(.)” will match “abc” and during the process it capture “a”,”b”,”c” for backreference but regex “(?:.)(?:.)(?:.)” will match “abc” and do not capture any part.

Important Key Points:

  1. If you want to match one of the above metacharacters in your regex expression, then you must simply mention that character with a backslash. Example- I want to match dollar sign in the expression then I must use it like \$ and similarly if I want to match a backslash in the expression then I must use double backslash \\. Here the backslash has meaning to escape the special meaning of the next character.
  2. For character classes, only square brackets, backslash, caret and hyphen has the special meaning. Except these, other will be treated as a literal character inside the square brackets. E.g. matching a + sign or * sign, [+*]
  3. Suppose I must find out – addition or multiplication of digits in a set of a text string.
    Something like: first digit + second digit, first digit * second digit
    Then I must create an expression like: find a number then a plus sign and then again, a number. The expression will be “\d(\+)\d”
  4. Single curly bracket has no meaning in the regex but when they are used for repetition like {1,5} they have a special meaning. So, to match single curly bracket either { or }, you do not need to use it with a backslash.
  5. Special escape sequence: \Q…\E or only \Q…
    Anything written within these escape sequence will no longer have special meaning in the regular expression. Suppose I have to find a regular expression “[0-9]{1,}(\+)[0-9]{1,}” in a string with regex then the expression will be “\Q[0-9]{1,}(\+)[0-9]{1,}\E”

NOTE: you can test your regex with the online tool.

  1. https://regex101.com/
  2. http://www.regexplanet.com/

What is next?

  1. Non-capturing parentheses.
  2. Look-ahead.
  3. Look-behind.