Chomsky normal form

In formal language theory, a context-free grammar G is said to be in Chomsky normal form (first described by Noam Chomsky) [1] if all of its production rules are of the form: [2]:92–93,106

ABC,   or
Aa,   or
S → ε,

where A, B, and C are nonterminal symbols, a is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε denotes the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if ε is in L(G), namely, the language produced by the context-free grammar G.

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one [note 1] which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.

Converting a grammar to Chomsky normal form

To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory. [2]:87–94 [3] [4] [5] The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009). [6] [note 2] Each of the following transformations establishes one of the properties required for Chomsky normal form.

START: Eliminate the start symbol from right-hand sides

Introduce a new start symbol S0, and a new rule

S0S,

where S is the previous start symbol. This doesn't change the grammar's produced language, and S0 won't occur on any rule's right-hand side.

TERM: Eliminate rules with nonsolitary terminals

To eliminate each rule

AX1 ... a ... Xn

with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol Na, and a new rule

Naa.

Change every rule

AX1 ... a ... Xn

to

AX1 ... Na ... Xn.

If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol. This doesn't change the grammar's produced language. [2]:92

BIN: Eliminate right-hand sides with more than 2 nonterminals

Replace each rule

AX1 X2 ... Xn

with more than 2 nonterminals X1,...,Xn by rules

AX1 A1,
A1X2 A2,
... ,
An-2Xn-1 Xn,

where Ai are new nonterminal symbols. Again, this doesn't change the grammar's produced language. [2]:93

DEL: Eliminate ε-rules

An ε-rule is a rule of the form

A → ε,

where A is not S0, the grammar's start symbol.

To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals nullable, and compute them as follows:

  • If a rule A → ε exists, then A is nullable.
  • If a rule AX1 ... Xn exists, and every single Xi is nullable, then A is nullable, too.

Obtain an intermediate grammar by replacing each rule

AX1 ... Xn

by all versions with some nullable Xi omitted. By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained. [2]:90

For example, in the following grammar, with start symbol S0,

S0AbB | C
BAA | AC
Cb | c
Aa | ε

the nonterminal A, and hence also B, is nullable, while neither C nor S0 is. Hence the following intermediate grammar is obtained: [note 3]

S0AbB | AbB | AbB | AbB   |   C
BAA | AA | AA | AεA   |   AC | AC
Cb | c
Aa | ε

In this grammar, all ε-rules have been " inlined at the call site". [note 4] In the next step, they can hence be deleted, yielding the grammar:

S0AbB | Ab | bB | b   |   C
BAA | A   |   AC | C
Cb | c
Aa

This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,bab,bac,bb,bc,c}, but apparently has no ε-rules.

UNIT: Eliminate unit rules

A unit rule is a rule of the form

AB,

where A, B are nonterminal symbols. To remove it, for each rule

BX1 ... Xn,

where X1 ... Xn is a string of nonterminals and terminals, add rule

AX1 ... Xn

unless this is a unit rule which has already been (or is being) removed.

Order of transformations

When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.

Moreover, the worst-case bloat in grammar size [note 5] depends on the transformation order. Using |G| to denote the size of the original grammar G, the size blow-up in the worst case may range from |G|2 to 22 |G|, depending on the transformation algorithm used. [6]:7 The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar. [6]:5 The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.

Other Languages