Lalr(1) parsing
Regular grammar generators, like Lex, are often coupled with tools, such as Yacc and Bison, that can generate parsers for more powerful languages, namely (a subset of) context-free languages. These tools take as input a description of the language to be recognized and generate a parser for that language, written in some other language (for example, Yacc and Bison generate parsers written in C). The user must always be aware of the generated parser and that is a nuisance. Bigloo provides such a tool that overcomes this annoyance. It generates parsers for the class of Lalr(1) grammars in a more opaque way.Grammar definition
An lalr(1) grammar is defined by the form:lalr-grammar term-def non-term-def...bigloo syntax
term-def is a list of terminal elements of the grammar. Terminals can grouped together to form precedence groups by including the related symbols in a sub-lists of the term-def list. Each precedence group must start with one of the keywordsleft:
, right:
or none:
-- this
indicates the associativity of the terminal symbol. Here is a sample
term-def which declares eight terminals:
(terminal-1 terminal-2 (left: terminal-3 terminal-4) terminal-5 (right: terminal-6) (none: terminal-7) terminal-8)In this case,
terminal-3
and terminal-4
both have the same
precedence, which is greater than the precedence assigned to
terminal-6
. No precedence was assigned to symbols terminal-1
,
terminal-2
, terminal-5
or terminal-8
.
Each non-term-def is a list whose first element is the
non-terminal being defined, i.e. a symbol. The remaining elements are
the production rules associated with this non-terminal. Each rule is a
list whose first element is the rule itself (a list of symbols) and
the other elements are the semantic actions associated with that
particular rule.
For example, consider the following grammar:
E ⇒ E1 + id {E.val := E1.val + id.val} | id {E.val := id.val}With Bigloo, it would be written:
(lalr-grammar (plus id) (e ((e plus id) (+ e id)) ((id) id)))The semantic value of a symbol in a rule can be accessed by simply using the name of the symbol in the semantic action associated with the rule. Because a rule can contain multiple occurrences of the same symbol, Bigloo provides a way to access these occurrences separately. To do so, the name of each occurrence must be suffixed by
@
var where var is the name of a variable that will be
bound to the semantic value of the occurrence. For example, if the
rule is
ifstmt ⇒ if E then Stmt else Stmtthen, in Bigloo, it would look like
(if-stmt ((if e then stmt@conseq else stmt@altern) (if (eval e) (eval conseq) (eval altern))))
.keep
Precedence and associativity
The bigloo lalr(1) parser generator supports operator precedence and associativity. The method for specifying the precedence for terminal symbols is described in Grammar Definition. Precedence is assigned to each non-terminal production from the precedence of the last terminal symbol appearing in that production. Typically, when the parser generator encounters a shift/reduce conflict, it produces a warning message, then chooses to reduce. When a parser generator has precedence and associativity information, it can make a much more sophisticated decision. Let's use this simple calculator grammar as an example:(lalr-grammar ((left: op-mult op-div) (left: op-add op-sub) op-lparen op-rparen op-semicolon number) (file (()) ((file stmt))) (stmt ((expr op-semicolon) (print expr))) (expr ((number) number) ((expr@a op-add expr@b) (+ a b)) ((expr@a op-sub expr@b) (- a b)) ((expr@a op-mult expr@b) (* a b)) ((expr@a op-div expr@b) (/ a b)) ((op-lparen expr op-rparen) expr))))Let's start with this input:
1 + 2 * 3;At the point where the parser has read
1 + 2
and the lookahead symbol
is *
, the parser encounters a shift/reduce conflict. Should it first
reduce by the (expr op-add expr)
production or shift the *
in
the hopes of reducing the latter expression first?
The (expr op-add expr)
production has gotten its precedence from the
op-add
terminal symbol. This is the precedence of the reduce. The
precedence of the shift comes from the precedence assigned to the lookahead
terminal symbol, which is op-mult
. Since op-mult
has higher
precedence, the parser generator in this state chooses to shift and does not
produce a warning.
Here's an example which we can use to demonstrate associativity:
1 + 2 - 3;The parser generator encounters a similar shift/reduce conflict this time, except that when it tries to determine whether to shift or reduce, it finds that both actions have the same precedence. In this case, the parser generator looks at the associativity of the precedence group containing the
op-add
and op-sub
. Since these are declared to be
left-associative, the parser generator chooses to reduce from this state,
effectively calculating the 1 + 2
. Had these symbols been
right-associative, the parser would have chosen to shift, effectively
calculating 2 - 3
first. If these symbols had been declared
non-associative with the none:
keyword, the parser would generate an
error if it ever encountered this state.
The parsing function
Once a grammar has been defined, it can be used to parse some input using the following function:read/lalrp lg rg port [emptyp]bigloo procedure
This function takes three, possibly four, arguments. The first, lg, is the Lalr(1) grammar. The second, rg, is the lexical analyzer that feeds the grammar with tokens. The third argument, port, is the port that contains the input to be parsed. The last argument, emptyp, if provided, should be a function of one argument. It is called with each new token read from the port and should return#t
if the token denotes the
end of input. The result of the call is the value computed by the semantic
actions of the production rules.
.keep
The regular grammar
In order to work properly, the regular grammar used with an Lalr(1) grammar should follow some conventions:- If a semantic value is to be associated with the token just parsed, the regular grammar should return a pair whose
- If there is no value associated with the token, the regular grammar can return just the token name. When used in conjunction with an Lalr grammar, regular grammar should never return
car
is the
token name (a symbol) and the cdr
is the semantic value.
#f
as a token
value. This is specially true when the regular grammar detects the end of
parsing. In that case, the regular grammar must not return the
#f
value. A good way to handle end-of-file is illustrated in the
following example:
(let ((g (regular-grammar () ... (else (let ((c (the-failure))) (if (eof-object? c) c (error 'rgc "Illegal character" c)))))) (l (lalr-grammar ...))) (read/lalrp l g (current-input-port)))This way, the Lalr grammar will automatically handles the end-of-file.
Debugging Lalr Grammars
Currently the debugging facility for debugging Lalr grammars is very limited. When the parameterbigloo-debug
is set to a value
greater or equal to 100, the Lalr engine outputs all of the state
changes the parser is going through.
A simple example
Here is the code for a simple calculator implemented by an Lalr(1) grammar:(begin (read/lalrp (lalr-grammar (nl plus mult minus div const lpar rpar) (lines (()) ((lines expression nl) (display "--> ") (display expression) (newline)) ((lines nl))) (expression ((expression plus term) (+ expression term)) ((expression minus term) (- expression term)) ((term) term)) (term ((term mult factor) (* term factor)) ((term div factor) (/ term factor)) ((factor) factor)) (factor ((lpar expression rpar) expression) ((const) const))) (regular-grammar () ((+ (or #\tab #\space)) (ignore)) (#\newline 'nl) ((+ digit) (cons 'const (string->number (the-string)))) (#\+ 'plus) (#\- 'minus) (#\* 'mult) (#\/ 'div) (#\( 'lpar) (#\) 'rpar)) (current-input-port)) (reset-eof (current-input-port)))