Chomsky normal form |
In
where A, B, and C are
Every grammar in Chomsky normal form is
To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory.^{ [2]}^{:87–94}^{ [3]}^{ [4]}^{ [5]} The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).^{ [6]}^{ [note 2]} Each of the following transformations establishes one of the properties required for Chomsky normal form.
Introduce a new start symbol S_{0}, and a new rule
where S is the previous start symbol. This doesn't change the grammar's produced language, and S_{0} won't occur on any rule's right-hand side.
To eliminate each rule
with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol N_{a}, and a new rule
Change every rule
to
If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol. This doesn't change the grammar's produced language.^{ [2]}^{:92}
Replace each rule
with more than 2 nonterminals X_{1},...,X_{n} by rules
where A_{i} are new nonterminal symbols. Again, this doesn't change the grammar's produced language.^{ [2]}^{:93}
An ε-rule is a rule of the form
where A is not the grammar's start symbol.
To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals nullable, and compute them as follows:
Obtain an intermediate grammar by replacing each rule
by all versions with some nullable X_{i} omitted. By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained.^{ [2]}^{:90}
For example, in the following grammar, with start symbol S_{0},
the nonterminal A, and hence also B, is nullable, while neither C nor S_{0} is. Hence the following intermediate grammar is obtained:^{ [note 3]}
In this grammar, all ε-rules have been "
This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,bab,bac,bb,bc,c}, but apparently has no ε-rules.
A unit rule is a rule of the form
where A, B are nonterminal symbols. To remove it, for each rule
where X_{1} ... X_{n} is a string of nonterminals and terminals, add rule
unless this is a unit rule which has already been removed.
Mutual preservation of transformation results |
|||||
---|---|---|---|---|---|
Transformation X always preserves (✓) resp. may destroy (✗) the result of Y: |
|||||
_{X}\^{Y} | START | TERM | BIN | DEL | UNIT |
START | ✓ | ✓ | ✗ | ✗ | |
TERM | ✓ | ✗ | ✓ | ✓ | |
BIN | ✓ | ✓ | ✓ | ✓ | |
DEL | ✓ | ✓ | ✓ | ✗ | |
UNIT | ✓ | ✓ | ✓ | (✓)^{*} | |
^{*}UNIT preserves the result of DEL if START had been called before. |
When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.
Moreover, the worst-case bloat in grammar size^{ [note 5]} depends on the transformation order. Using |G| to denote the size of the original grammar G, the size blow-up in the worst case may range from |G|^{2} to 2^{2 |G|}, depending on the transformation algorithm used.^{ [6]}^{:7} The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar.^{ [6]}^{:5} The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.