Sunday, 15 November 2015

3. Python Lex

Tokenizing with PLY's 'lex' module


The first step to implementing an interpreter is to write the tokenizer using Python Lex. Tokens are recognized using regular expressions. The steps are straightforward:

1. The names of all of the valid token types are declared as a list of strings named tokens.

2. Tokens that require no special processing are declared using module-level variables named t_TOKEN_TYPE where TOKEN_TYPE matches some string in the "tokens" list. Each such variable contains a regular expression string that matches the respective token (Python raw strings are usually used since they are the most convenient way to write regular expression strings).

3. Tokens that do require special processing are declared using module-level functions that follow the same naming convention. Each such function has a doc string that is a regular expression that matches the respective token. The order of these functions is important since each regular expression is tried in order. Hence, functions with more complex regular expressions (floats, for instance) should come before functions with less complex regular expressions (ints, for example). Each function receives a value of type Lex Token, modifies it, and usually returns it. Unlike Lex in C, which uses unions, the returned value has an attribute named "value" that can be set to a value of any type (even an instance). This flexibility is convenient. If None is returned or if no return is encountered, the lexer ignores the current token and moves on to the next token. This can be used to implement whitespace handling, but a module-level variable named t_ignore is more efficient for this task. Having defined the tokenizer rules, lex.lex() is used to build the tokenizer. A main (for instance, if __name__ == '__main__':) is included that tests the functionality of the tokenizer using regression tests. This is good programming practice in Python, but is almost essential when writing a language. 

In Contrast:

lex.py is used to tokenize an input string. For example, suppose you're writing a programming language and a user supplied the following input string:
x = 3 + 42 * (s - t)
A tokenizer splits the string into individual tokens
'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')'
Tokens are usually given names to indicate what they are. For example:
'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES',
'LPAREN','ID','MINUS','ID','RPAREN'
More specifically, the input is broken into pairs of token types and values. For example:
('ID','x'), ('EQUALS','='), ('NUMBER','3'),
('PLUS','+'), ('NUMBER','42), ('TIMES','*'),
('LPAREN','('), ('ID','s'), ('MINUS','-'),
('ID','t'), ('RPAREN',')'
The identification of tokens is typically done by writing a series of regular expression rules. The next section shows how this is done using lex.py.
PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and the match value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with the matched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match is ignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts the value to an int.


Simple Example:

import lex

tokens = (
    "SYMBOL",
    "COUNT"
)

t_SYMBOL = (
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

lex.input("CH3COOH")
for tok in iter(lex.token, None):
    print repr(tok.type), repr(tok.value)
 


Output:

When I run the code I get the following
'SYMBOL' 'C'
'SYMBOL' 'H'
'COUNT' 3
'SYMBOL' 'C'
'SYMBOL' 'O'
'SYMBOL' 'O'
'SYMBOL' 'H'
You can see that the count was properly converted to an integer

No comments:

Post a Comment