WHAT IS REGULAR EXPRESSION IN COMPUTATION THEORY ?
To work with formal languages and string patterns, it is essential to understand regular expressions, regular grammar, and regular languages. These concepts form the foundation of automata theory, compiler design, and text processing.

DEFINITION:
- Regular expressions are symbolic notations used to define search patterns in strings. They describe regular languages and are commonly used in tasks such as validation, searching, and parsing.
- A regular expression is basically a shorthand way of showing how a regular language is built from the base set of regular languages.
-
A regular expression represents a regular language if it follows these rules:
- ϵ (epsilon) is a regular expression representing the language {ϵ} (the empty string).
- Any symbol ‘a’ from the input alphabet Σ is a regular expression, representing the language {a}.
- Union (a + b) is a regular expression if a and b are regular expressions, representing the language {a, b}.
- Concatenation (ab) is a regular expression if a and b are regular expressions.
- Kleene star (a) is a regular expression*, meaning zero or more occurrences of ‘a’, forming a regular language.
Description | Regular Expression | Regular Languages |
---|---|---|
Set of vowels | `(a | e |
‘a’ followed by 0 or more ‘b’ | ab* | {a, ab, abb, abbb, abbbb, ...} |
Any number of vowels followed by any number of consonants |
[aeiou]* [bcdfghjklmnpqrstvwxyz]* |
{ε, a, aou, aiou, b, abcd, ...} (ε represents empty string) |
REGULAR GRAMMAR
-
A regular grammar is a formal grammar that generates regular languages. It consists of:
- Terminals: Symbols that form strings (e.g.,
a
,b
). - Non-terminals: Variables used to derive strings (e.g.,
S
,A
). - Production Rules: Rules for transforming non-terminals into terminals or other non-terminals.
- Start Symbol: The non-terminal from which derivations begin.
- Terminals: Symbols that form strings (e.g.,
-
Regular grammar generates regular language. They have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a non-terminal.
The productions must be in the form:
A ⇢ xB
A ⇢ x
A ⇢ Bx
where A, B ∈ Variable(V) and x ∈ T* i.e. string of terminals.
-
Types of regular grammar:
- Left Linear grammar(LLG)
- Right linear grammar(RLG)
- In LLG, the productions are in the form if all the productions are of the form
A ⇢ Bx
A ⇢ x
where A,B ∈ V and x ∈ T*
- In RLG, the productions are in the form if all the productions are of the form
A ⇢ xB
A ⇢ x
where A,B ∈ V and x ∈ T*
REGULAR LANGUAGE
- Regular languages are the class of languages that can be represented using finite automata, regular expressions, or regular grammar. These languages have predictable patterns and are computationally efficient to recognize.
- They are used to define sets of strings, such as sequences of characters or words, that follow specific patterns.
- They are important in computer science and theoretical computer science because they form a foundation for understanding the theory of computation and the design of compilers and other software tools.
PROPERTIES OF REGULAR LANGUAGE:
-
Regular languages are closed under operations like union, concatenation, and Kleene star.
- Union: If L1 and If L2 are two regular languages, their union L1 ? L2 will also be regular. For example, L1 = {an | n ? 0} and L2 = {bn | n ? 0} L3 = L1 ? L2 = {an ? bn | n ? 0} is also regular.
- Intersection: If L1 and If L2 are two regular languages, their intersection L1 ? L2 will also be regular. For example, L1= {ambn | n ? 0 and m ? 0} and L2= {ambn ? bnam | n ? 0 and m ? 0} L3 = L1 ? L2 = {ambn | n ? 0 and m ? 0} is also regular.
- Concatenation: If L1 and If L2 are two regular languages, their concatenation L1.L2 will also be regular. For example, L1 = {an | n ? 0} and L2 = {bn | n ? 0} L3 = L1.L2 = {am. bn | m ? 0 and n ? 0} is also regular.
- Kleene Closure: If L1 is a regular language, its Kleene closure L1* will also be regular. For example, L1 = (a ? b) L1* = (a ? b)*
- Complement: If L(G) is regular language, its complement L’(G) will also be regular. Complement of a language can be found by subtracting strings which are in L(G) from all possible strings. For example, L(G) = {an | n > 3} L’(G) = {an | n <= 3}
COMPARISON OF REGULAR EXPRESSIONS, GRAMMAR AND LANGUAGES:
Aspect | Regular Expressions | Regular Grammar | Regular Languages |
---|---|---|---|
Definition | Pattern representation of strings | Rule-based generation of strings | Language class described by regex and grammar |
Representation | Symbols and operators | Terminals, non-terminals, production rules | Finite automata, regex, or grammar |
Use Case | Text processing, validation | Syntax generation for compilers | Language recognition |
KLEENE'S THEOREM
- A language is said to be regular if it can be represented by using Finite Automata or if a Regular Expression can be generated for it. This definition leads us to the general definition that; For every Regular Expression corresponding to the language, a Finite Automata can be generated.
- or certain expressions like:- (a+b), ab, (a+b)*; It’s fairly easier to make the Finite Automata by just intuition as shown below. The problem arises when we are provided with a longer Regular Expression. This brings about the need for a systematic approach towards Finite Automata generation, which Kleene has put forward in Kleene’s Theorem.
CONVERSION FROM REGULAR EXPRESSION TO FA:
- As the regular expressions can be constructed from Finite Automata using the State Elimination Method, the reverse method, state decomposition method can be used to construct Finite Automata from the given regular expressions.
- This method will construct NFA (with or without ε-transitions, depending on the expression) for the given regular expression, which can be further converted to DFA using NFA to DFA conversion.
STATE DECOMPOSITION METHOD:
-
Theorem: Every language defined by a regular expression is also defined by a Finite Automata.
Proof: Let’s assume L = L(R) for a regular expression R. We prove that L = L(M) for some ε-NFA M with:
1) Exactly one accepting state.
2) No incoming edges at the initial state.
3) No outgoing edges at the accepting state.
-
The proof is done by structural induction on R by following the steps below:
Step 1: Create a starting state, say q1, and a final state, say q2. Label the transition q1 to q2 as the given regular expression, R, as in Fig 1. But, if R is (Q)*, Kleene’s closure of another regular expression Q, then create a single initial state, which will also be the final state.
- Repeat the following rules (state decomposition method) by considering the least precedency regular expression operator first until no operator is left in the expression. Precedence of operators in regular expressions is defined as Union < Concatenation < Kleene’s Closure.
- Kleene’s Closure (*) can be eliminated by introducing self-loops on states.
PUMPING LEMMA FOR REGULAR LANGUAGES
- For any regular language L, there exists an integer n, such that for all x ∈ L with |x| ≥ n, there exists u, v, w ∈ Σ, such that x = uvw, and
(1) for all i ≥ 0: uv'w ∈ L
(2) v ≥1
(3) |uv| ≤n - Pumping Lemma is used as a proof for irregularity of a language. Thus, if a language is regular, it always satisfies pumping lemma.
- If there exists at least one string made from pumping which is not in L, then L is surely not regular.
- The opposite of this may not always be true. That is, if Pumping Lemma holds, it does not mean that the language is regular.
- For Example, let us prove L01 = {0n1n | n ? 0} is irregular. Let us assume that L is regular, then by Pumping Lemma the above given rules follow. Now, let x ? L and |x| ? n. So, by Pumping Lemma, there exists u, v, w such that (1) – (3) hold.
- We show that for all u, v, w, (1) – (3) does not hold. If (1) and (2) hold then x = 0n1n = uvw with |uv| ? n and |v| ? 1. So, u = 0a, v = 0b, w = 0c1n where : a + b ? n, b ? 1, c ? 0, a + b + c = n But, then (3) fails for i = 0 uv0w = uw = 0a0c1n = 0a + c1n ? L, since a + c ? n.
CONTEXT-FREE GRAMMAR
- A Context-Free Grammar (CFG) is a formal grammar that consists of a set of production rules used to generate strings in a language.
- However, many grammars contain redundant rules, unreachable symbols, or unnecessary complexities. Simplifying a CFG helps in reducing its size while preserving the generated language, making parsing more efficient.
-
Simplifying a CFG is essential for:
- Efficient Parsing: Smaller grammars require fewer computations.
- Better Understanding: Removing redundant parts makes it easier to analyze.
- Optimization: Improved computational efficiency in compilers and language processing.
CONTEXT-FREE LANGUAGES (CFLs)
- Context-Free Languages (CFLs) are an essential class of languages in the field of automata theory and formal languages.
- They are generated by context-free grammars (CFGs) and are recognized by pushdown automata (PDAs). Understanding the closure properties of CFLs helps in determining which operations preserve the context-free nature of a language.
- CFLs are closed under some operations, meaning that applying these operations to CFLs will always result in another CFL. However, they are not closed under certain operations, which means that performing these operations on CFLs might result in a language that is not context-free.
PROPERTIES:
-
CFLs are closed under the following operations:
1. Union:
If L and M are two context-free languages, then their union L ∪ M is also a CFL.
Construction:
- Consider two context-free grammars, G and H, for L and M respectively.
- Assume that G and H have no common variables (this can be ensured by renaming variables if needed).
- Introduce a new start symbol S and add the rule: S → S₁ | S₂ Here, S₁ and S₂ are the start symbols of G and H, respectively.
- The resulting grammar generates L ∪ M, proving that CFLs are closed under union.
Example: Let L₁ = { aⁿbⁿcᵐ | m ≥ 0, n ≥ 0 } and L₂ = { aⁿbᵐcᵐ | n ≥ 0, m ≥ 0 }.
- L₁ enforces that the number of a’s equals the number of b’s.
- L₂ enforces that the number of b’s equals the number of c’s.
- Their union states that either of these conditions must be satisfied, making the resulting language context-free.
Note: So CFL are closed under Union.
2. Concatenation:
If L and M are CFLs, then their concatenation LM is also a CFL.
Construction:
- Let G and H be the context-free grammars for L and M, respectively.
- Assume that G and H have no common variables.
- Introduce a new start symbol S and add the production:S → S₁ S₂Here, S₁ and S₂ are the start symbols of G and H, respectively.
- This ensures that any derivation from S will first generate a string in L and then a string in M, proving that CFLs are closed under concatenation.
Example
L1 = { anbn | n >= 0 } and L2 = { cmdm | m >= 0 }
L3 = L1.L2 = { anbncmdm | m >= 0 and n >= 0} is also context free.
L1 says number of a’s should be equal to number of b’s and L2 says number of c’s should be equal to number of d’s. Their concatenation says first number of a’s should be equal to number of b’s, then number of c’s should be equal to number of d’s. So, we can create a PDA which will first push for a’s, pop for b’s, push for c’s then pop for d’s. So it can be accepted by pushdown automata, hence context free.Note: So CFL are closed under Concatenation.
3. Kleene Closure:
If L is a CFL, then its Kleene closure L* (zero or more repetitions of strings in L) is also a CFL.
Construction:
- Let G be the context-free grammar for L with start symbol S₁.
- Introduce a new start symbol S and add the rule:S → S₁ S | εThis rule ensures that S can derive zero or more copies of S₁, proving closure under the Kleene star.
Example
L1 = { anbn | n >= 0 }
L1* = { anbn | n >= 0 }* is also context free.Note :So CFL are closed under Kleen Closure.
4. Intersection and complementation:
If L1 and If L2 are two context free languages, their intersection L1 ? L2 need not be context free.
Example
L1 = { anbncm | n >= 0 and m >= 0 } and L2 = (ambncn | n >= 0 and m >= 0 }
L3 = L1 ? L2 = { anbncn | n >= 0 } need not be context free.
L1 says number of a’s should be equal to number of b’s and L2 says number of b’s should be equal to number of c’s. Their intersection says both conditions need to be true, but push down automata can compare only two. So it cannot be accepted by pushdown automata, hence not context free.
Similarly, complementation of context free language L1 which is ?* – L1, need not be context free.Note : So CFL are not closed under Intersection and Complementation.
5. Reversal:
If L is a CFL, then its reversal L^R (where each string is reversed) is also a CFL.
Construction:
- Take the context-free grammar G for L.
- Modify each production so that the right-hand side of every rule is reversed.
- The resulting grammar generates L^R, proving that CFLs are closed under reversal.
Example:
- Original Grammar: S → 0S1 | 01
- Reversed Grammar: S → 1S0 | 10
6. Homomorphism:
A homomorphism is a function that replaces each symbol in a string with another string.
If L is a CFL and h is a homomorphism, then h(L) is also a CFL.
Example:
- Let G have the production S → 0S1 | 01.
- Define a homomorphism h(0) = ab, h(1) = ε.
- Then, h(L(G)) will be generated by a new grammar with productions:S → abS | ab
This shows that CFLs are closed under homomorphism.
PUMPING LEMMA FOR CONTEXT-FREE LANGUAGES (CFLs)
- Pumping Lemma for CFL states that for any Context Free Language L, it is possible to find two substrings that can be ‘pumped’ any number of times and still be in the same language.
- For any language L, we break its strings into five parts and pump second and fourth substring. Pumping Lemma, here also, is used as a tool to prove that a language is not CFL. Because, if any one string does not satisfy its conditions, then the language is not CFL.
- If L is a context-free language, there is a pumping length p such that any string w ∈ L of length ≥ p can be written as w = uvxyz, where vy ≠ ε, |vxy| ≤ p, and for all i ≥ 0, uvixyiz ∈ L.
- Example:
Find out whether the language L = {xnynzn | n ≥ 1} is context free or not.
-
Let L is context free. Then, L must satisfy pumping lemma.
At first, choose a number n of the pumping lemma. Then, take z as 0n1n2n.
Break z into uvwxy, where
|vwx| ≤ n and vx ≠ ε.
Hence vwx cannot involve both 0s and 2s, since the last 0 and the first 2 are at least (n+1) positions apart. There are two cases −
Case 1 − vwx has no 2s. Then vx has only 0s and 1s. Then uwy, which would have to be in L, has n 2s, but fewer than n 0s or 1s.
Case 2 − vwx has no 0s.
Here contradiction occurs.
Hence, L is not a context-free language.
MEALY AND MOORE MACHINES
- Moore and Mealy Machines are Transducers that help in producing outputs based on the input of the current state or previous state.
- The Mealy and Moore machines form the backbone of state-based systems and are integral to automata theory in TOC.
MOORE MACHINES:
- Moore Machines are finite state machines with output value and its output depends only on the present state.
-
It can be defined as (Q, q0, ∑, O, δ, λ) where:
- Q is a finite set of states.
- q0 is the initial state.
- ∑ is the input alphabet.
- O is the output alphabet.
- δ is the transition function which maps Q×∑ → Q.
- λ is the output function which maps Q → O.
-
In the Moore machine , the output is represented with each input state separated by /. The length of output for a Moore Machine is greater than input by 1.
- Input: 1,1
- Transition: δ (q0,1,1)=> δ(q2,1)=>q2
- Output: 000 (0 for q0, 0 for q2 and again 0 for q2)
MEALY MACHINES:
- Mealy machines are also finite state machines with output value and their output depends on the present state and current input symbol.
-
It can be defined as (Q, q0, ∑, O, δ, λ’) where:
- Q is a finite set of states.
- q0 is the initial state.
- ∑ is the input alphabet.
- O is the output alphabet.
- δ is the transition function which maps Q×∑ → Q.
- ‘λ’ is the output function that maps Q×∑→ O.
-
In the mealy machine shown in Figure 1, the output is represented with each input symbol for each state separated by /. The length of output for a mealy machine is equal to the length of input.
- Input: 1,1
- Transition: δ (q0,1,1)=> δ(q2,1)=>q2
- Output: 00 (q0 to q2 transition has Output 0 and q2 to q2 transition also has Output 0)
CHOMSKY HIERARCHY
- The Chomsky hierarchy is a collection of various formal grammars. With the use of this formal grammar, it can generate some formal languages.
- They can be defined by multiple types of devices that can identify these languages such as finite state automata, pushdown automata, linear bounded automata, and Turing machines, respectively.
-
According to Chomsky hierarchy, grammar is divided into 4 types as follows:
- Type 0 is known as unrestricted grammar.
- Type 1 is known as context-sensitive grammar.
- Type 2 is known as a context-free grammar.
- Type 3 Regular Grammar.
TYPE 0-UNRESTRICTED GRAMMAR:
- Type-0 grammars include all formal grammar. Type 0 grammar languages are recognized by turing machine. These languages are also known as the Recursively Enumerable languages.
-
Grammar Production in the form where
is ( V + T)* V ( V + T)*
V : Variables
T : Terminals.
is ( V + T )*.In type 0 there must be at least one variable on the Left side of production.
TYPE 1-CONTEXT SENSITIVE GRAMMAR:
- Type-1 grammars generate context-sensitive languages. The language generated by the grammar is recognized by the Linear Bound Automata.
-
In Type 1
- First of all Type 1 grammar should be Type 0.
- Grammar Production in the form of
That is the count of symbol in is less than or equal to
Also ? ? (V + T)+
i.e. ? can not be ?
TYPE 2-CONTEXT FREE GRAMMAR:
- Type-2 grammars generate context-free languages. The language generated by the grammar is recognized by a Pushdown Automata.
-
In Type 2:
- First of all, it should be Type 1.
- The left-hand side of production can have only one variable and there is no restriction on
-
For example:
S -->AB
A -->a
B -->b
-
TYPE 3- REGULAR GRAMMAR:
- Type-3 grammars generate regular languages. These languages are exactly all languages that can be accepted by a finite-state automaton. Type 3 is the most restricted form of grammar.
-
Type 3 should be in the given form only :
V --> VT / T (left-regular grammar)
(or)
V --> TV /T (right-regular grammar)