Documentation for Wisent

This chapter describes the Wisent parser generator itself. The python code generated by Wisent is described in the following chapter.

Describing the Format of your Input

Wisent grammar files are used to describe the format of the input your program has to read. By convention these files have names ending in ”.wi”. If the grammar file contains non-ASCII characters, it must be UTF-8 encoded.

  • A grammar file consist of a sequence of rules.
  • Each rule consists of a symbol, a colon :, the expansion of the symbol given as a list of tokens, and a semicolon ;.
  • In the list of tokens on the right-hand side of the colon, the following operators can be used: item * is an abbreviation for an arbitrary number of occurrences of item, item + stands for one or more occurrences of item, and item ? stands for zero or one occurrences of item.
  • | can be used to separate alternatives, and brackets ( and ) can be used for grouping.
  • Symbols which contain non-alphanumeric characters should be put in quotation marks.
  • The character # can be used for comments in the grammar files: everything after a #, until the next end of line, is ignored.
  • Exclamation marks ! can be used in certain places to resolve conflicts in the gammar. This mechanism is explained in the next section.

Every symbol which occurs on the left hand side of a rule is a non-terminal symbol. Every symbol which occurs in the expansion of a rule and is not a non-terminal symbol is considered to be a terminal symbol. The non-terminal symbol on the left hand side of the first grammar rule is used as the ‘start symbol’.

Example 1. The grammar to describe the structure of Wisent input files looks as follows:

grammar: rule*;
rule: token ":" _alternatives ";";
_alternatives: list ( "|" list )*;
list: ( _item("?"|"*"|"+")? | "!" )* ;
_item: token | string | group;
group: "(" _alternatives ")";

Note, that there is some tricky recursion involved here: this grammar describes its own structure! This effect corresponds to the fact that Wisent’s input parser is generated by wisent itself.

Wisent uses symbol names which start with an underscore character _ to denote “uninteresting” symbols. These symbols are omitted from the generated parse trees (see the more detailed description in the section about Transparent Symbols).

Conflicts

Some grammars have the property that certain input strings can be represented in two or more ways. Because of this ambiguity, Wisent cannot generate a parser for this grammar and the grammar needs to be fixed. In technical terms, this problem mean that the input grammar contains shift-reduce conflicts or reduce-reduce conflicts. Since such problems are often difficult to debug, Wiesent generates an example input string to illustrate the problem.

Example 2 (shift-reduce conflict). Consider the following grammar file:

ABC: a X Y c;
X: b | ;
Y: b | ;

In the language described by this grammar, for the input “a b c” the symbol “b” could be either interpreted as “X” or as “Y”. The corresponding Wisent error message is as follows:

$ wisent ex2.wi
ex2.wi:2:4: shift-reduce conflict: the input
ex2.wi:2:4:     'a'.'b' ...
ex2.wi:2:4:   can be shifted using the production rule
ex2.wi:2:4:     'X': .'b';
ex2.wi:2:8:   or can be reduced to
ex2.wi:2:8:     'a' 'X'.'b' ...
ex2.wi:2:8:   using the production rule
ex2.wi:2:8:     'X': ;
ex2.wi: 1 conflict, aborting ...

a b ... is the input string Wisent chose to illustrate the problem, the dot . indicates how far the parsing process has progressed when the problem occurs. The first possibility at this point (“can be shifted”) is that that at this point we are reading the expansion of an X and are just about to read in the b (indicated by the position of the dot in the fourth line of output). In this case, X would expand to b. The second possibility is that at this point we are already done reading the X; X would expand to the empty sequence in this case.

Example 3 (reduce-reduce conflict). Consider the following grammar file:

ABC: ( X | Y ) b c;
X: a;
Y: a;

Clearly, in the input string a b c, the a could be either an X or a Y. The resulting error message is:

$ wisent ex3.wi
ex3.wi:2:5: reduce-reduce conflict: the input
ex3.wi:2:5:     'a'.'b' ...
ex3.wi:2:5:   can be reduced to
ex3.wi:2:5:     'X'.'b' ...
ex3.wi:2:5:   using the production rule
ex3.wi:2:5:     'X': 'a';
ex3.wi:3:5:   or can be reduced to
ex3.wi:3:5:     'Y'.'b' ...
ex3.wi:3:5:   using the production rule
ex3.wi:3:5:     'Y': 'a';
ex3.wi: 1 conflict, aborting ...

Conflicts should normally be resolved by rewriting the grammar to make it unambiguous. Since this is sometimes difficult or cumbersome, Wisent provides an alternative mechanism to to select which one of the conflicting actions the generated parser should choose. These conflict overrides are selected by placing additional exclamation marks ! in the grammar file:

  • placing the exclamation mark before a symbol (at the position given by the “can be shifted” part of the error message) gives precedence to the shift action. In example 2, above, we could write X: ! b | ; to have the b interpreted as an X.
  • Placing the exclamation mark at the end of the rule (normally just before the semicolon) gives precedence to the reduce action. In example 2, we could write X: b | ! ; to have the b interpreted as an Y.

In either of these two cases, the correct position to place the exclamation mark can be found by looking at the line and column information given in the corresponding line of the error message.

Running Wisent

Wisent reads a grammar from a file and emits Python source code. The program is typically called as follows:

wisent -o parser.py grammar.wi

Command Line Options:

-o NAME     store output in NAME instead of printing to stdout
-e NAME     store example source code into NAME
-h          show a help message
-V          show version information

Table Of Contents

Previous topic

Tutorial

Next topic

Documentation for the Generated Parser