Documentation for the Generated Parser¶
This chapter describes the python module emitted by Wisent. See the previous chapter for information to about how to generate this output.
The Parser class¶
The output of Wisent is a complete Python source file, implementing
a single Python class Parser. The generated file can be used
stand-alone and has no dependency on Wisent any more.
Assuming you wrote Wisent’s output to the file parser.py, you
can use the generated parser as illustrated by the following code
sniplet:
from parser import Parser
input_data = some_iterable
p = Parser()
try:
tree = p.parse(input_data)
except p.ParseErrors, e:
handle_parse_errors(e.errors)
# now `tree` contains the parse tree
Parser objects have the following attributes:
- class Parser(max_err=None, errcorr_pre=4, errcorr_post=4)¶
This class implements the parser for input data in the form described by the Wisent input grammar.
The constructor arguments are all optional, they control the handling of parse errors: max_err can be given to bound the number of errors reported during one run of the parser. errcorr_pre controls how many tokens before an invalid token the parser considers when trying to repair the input. errcorr_post controls how far beyond an invalid token the parser reads when evaluating the quality of an attempted repair.
- parse(input)¶
A method to convert a given input into a parse tree. See the description below.
- terminals¶
A Python list, containing all terminal symbols of the grammar.
- leaves(tree)¶
A generator to iterate over all leaves (corresponding to terminal symbols) of a parse tree. See the description of parse trees below.
- ParseErrors¶
A reference to the
ParseErrorsexception class. This allows you to useexcept Parser.ParseErrorsclauses for error handling.
- EOF¶
An object used internally to mark the end of input. You might encounter this in data attached to a
ParseErrorsexception.
The parser class is always called Parser. In order to use
parsers for several different grammars in one project, you should
write the parsers to different files and then import them as in the
following example:
from parser1 import Parser as Parser1
from parser2 import Parser as Parser2
Parser Input¶
The input data to be parsed is given as the argument of the
Parser.parse() method. It must be an iterable, consisting of a
sequence of Python tuples and the first element of each tuple must be
a terminal symbol of the grammar. All other elements of the input
tuples are copied into the output parse tree and are otherwise ignored
by the parser; you can use them to attach semantic values to the
symbols or to keep track of input line numbers for use in error
messages.
Example 4. For a parser generated from the grammar given in example 1, the following Python sequence is a valid input:
[ ( 'token', 'grammar' ),
( ':', ),
( 'token', 'rule' ),
( '*', ),
( ';', ) ]
Normally, the parser input is obtained by “tokenizing” a string. You
can either use a generator function, or you can store the result of
tokenization in a list and then pass this list to
Parser.parse(). See the section Adding a Tokenizer in the
tutorial for an example of the second approach.
Parse Trees¶
The Parser.parse() method returns a parse tree. The definition
of the tree is recursive, using tuples to represent sub-trees.
There are two cases:
A parse tree corresponding to a terminal symbol equals the corresponding tuple from the input data. These nodes form the leaves of the parse tree; they can be recognised by the fact that the first element of the tuple is contained in the list
Parser.terminals.All other tuples are inner nodes of the tree, they correspond to grammar rules. These tuples have a non-terminal symbol as the first element. The remaining elements are the sub-trees corresponding to the items on the right-hand side of the corresponding rule.
The complete tree is thus a collection of nested Python tuples. The
first element of the tree returned by Parser.parse() is always
the start symbol of the grammar. Following these recursive rules, the
function print_tree from the example code in section
Running Wisent of the tutorial can be used to display a parse
tree.
Example 5. The input from example 4 (we ignore the special role of
symbols starting with _ for a moment) leads to the following parse
tree:
('grammar',
('rule',
('token', 'grammar'),
(':',),
('_alternatives',
('list',
('_item',
('token', 'rule')),
('*',))),
(';',)))
Transparent Symbols¶
In real applications, parse trees often are deeply nested because they
contain many levels of “uninteresting” symbols like _alternatives
in example 5. To ease processing of the resulting trees, Wisent
generated parsers omit non-terminal symbols whose names start with an
underscore ‘_’. The children of the non-terminal are directly
inserted into the containing tree, replacing the sub-tree. Each such
substitution reduces the local nesting level by one. Since these
special symbols cannot be seen in the resulting parse tree, they are
called transparent symbols.
Following these rules, the “real” parse tree generated in example 5 will be as follows:
('grammar',
('rule',
('token', 'grammar'),
(':',),
('list',
('token', 'rule'),
('*',)),
(';',)))
Parse Errors¶
When a parse error is encountered, Wisent tries to “repair” the input in order to continue parsing so that as many parse errors as possible can be found in one run. Repairs are attempted by inserting, deleting or changing a single token in a neighbourhood of the first un-parseable token.
All parse errors are returned simultaneously by raising a
ParseErrors exception.
- exception ParseErrors¶
The exception object has the following two attributes:
- errors¶
A list of tuples, each describing one error. Each tuple consists of the first input token which could not be processed and the list of terminal symbols which were allowed at this point. The list of allowed symbols might contain the special value
Parser.EOFto indicate that an end of input was allowed at this point.
- tree¶
A “repaired” parse tree which might be used for further error checking, or None if no repair was possible.