8 Context sensitive lexer

Before the version 2 of TPG, lexers were context sensitive. That means that the parser commands the lexer to match some tokens, i.e. different tokens can be matched in a same input string according to the grammar rules being used. These lexers were very flexible but slower than context free lexers because TPG backtracking caused tokens to be matched several times.

In TPG 2, the lexer is called before the parser and produces a list of tokens from the input string. This list is then given to the parser. In this case when TPG backtracks the token list remains unchanged.

Since TPG 2.1.2, context sensitive lexers have been reintroduced in TPG. By default lexers are context free but the CSL option (see 5.3.2) turns TPG into a context sensitive lexer.

8.2 Grammar structure

CSL grammar have the same structure than non CSL grammars (see 5.1) except from the CSL option (see 5.3.2).

8.3 CSL lexers

8.3.1 Regular expression syntax

The CSL lexer is based on the re module. The difference with non CSL lexers is that the given regular expression is compiled as this, without any encapsulation. Grouping is then possible and usable.

8.3.2 Token definition

In CSL lexers there is no predefined tokens. Tokens are always inlined and there is no precedance issue since tokens are matched while parsing, when encountered in a grammar rule.

A token definition can be simulated by defining a rule to match a particular token (see figure 8.1).

Figure 8.1:

Token definition in CSL parsers example

number/int<n> -> '\d+'/n ;

In non CSL parsers there are two kinds of tokens: true tokens and token separators. To declare separators in CSL parsers you must use the special separator rule. This rule is implicitly used before matching a token. It is thus necessary to distinguish lexical rules from grammar rules. Lexical rule declarations start with the lex keyword. In such rules, the separator rule is not called to avoid infinite recursion (separator calling separator calling separator ...). The figure 8.2 shows a separator declaration with nested C++ like comments.

Figure 8.2:

Separator definition in CSL parsers examples

    lex separator -> spaces | comment ;

    lex spaces -> '\s+' ;

    lex comment -> '/\*' in_comment* '\*/' ;        # C++ nested comments
    lex in_comment -> comment | '\*[^/]|[^\*]' ;

8.3.3 Token matching

In CSL parsers, tokens are matched as in non CSL parsers (see 6.3). There is a special feature in CSL parsers. The user can benefit from the grouping possibilities of CSL parsers. The text of the token can be saved with the infix / operator. The groups of the token can also be saved with the infix // operator. This operator (available only in CSL parsers) returns all the groups in a tuple. For example, the figure 8.3 shows how to read entire tokens and to split tokens.

Figure 8.3:

Token usage in CSL parsers examples

    lex identifier/i -> '\w+'/s ;           # a single identifier

    lex string/s -> "'([^\']*)'"//<s> ;     # a string without the quotes

    lex item/<key,val> -> "(\w+)=(.*)"//<key,val> ; # a tuple (key, value)

8.4 CSL parsers

There is no difference between CSL and non CSL parsers except from lexical rules which look like grammar rules¹.

Chapter 8
Context sensitive lexer

8.1 Introduction

8.2 Grammar structure

8.3 CSL lexers

8.3.1 Regular expression syntax

8.3.2 Token definition

8.3.3 Token matching

8.4 CSL parsers

Chapter 8Context sensitive lexer

8.1 Introduction

8.2 Grammar structure

8.3 CSL lexers

8.3.1 Regular expression syntax

8.3.2 Token definition

8.3.3 Token matching

8.4 CSL parsers

Chapter 8
Context sensitive lexer