Tuesday, 17 November 2015

3.2 Specification of Tokens


Each token is specified by writing a regular expression rule. Each of these rules are are defined by making declarations with a special prefix t_ to indicate that it defines a token. For simple tokens, the regular expression can be specified as strings such as this (note: Python raw strings are used since they are the most convenient way to write regular expression strings):
t_PLUS = r'\+'
In this case, the name following the t_ must exactly match one of the names supplied in tokens. If some kind of action needs to be performed, a token rule can be specified as a function. For example, this rule matches numbers and converts the string into a Python integer.
def t_NUMBER(t):
    r'\d+'
    try:
         t.value = int(t.value)
    except ValueError:
         print "Number %s is too large!" % t.value
  t.value = 0
    return t
When a function is used, the regular expression rule is specified in the function documentation string. The function always takes a single argument which is an instance of LexToken. This object has attributes of t.type which is the token type (as a string), t.value which is the lexeme (the actual text matched), t.lineno which is the current line number, and t.lexpos which is the position of the token relative to the beginning of the input text. By default, t.type is set to the name following the t_ prefix. The action function can modify the contents of the LexToken object as appropriate. However, when it is done, the resulting token should be returned. If no value is returned by the action function, the token is simply discarded and the next token read.
Internally, lex.py uses the re module to do its patten matching. When building the master regular expression, rules are added in the following order:

  1. All tokens defined by functions are added in the same order as they appear in the lexer file.
  2. Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
Without this ordering, it can be difficult to correctly match certain types of tokens. For example, if you wanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first. By sorting regular expressions in order of decreasing length, this problem is solved for rules defined as strings. For functions, the order can be explicitly controlled since rules appearing first are checked first.
To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function like this:
reserved = {
   'if' : 'IF',
   'then' : 'THEN',
   'else' : 'ELSE',
   'while' : 'WHILE',
   ...
}

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    return t
This approach greatly reduces the number of regular expression rules and is likely to make things a little faster.
Note: You should avoid writing individual rules for reserved words. For example, if you write rules like this,
t_FOR   = r'for'
t_PRINT = r'print'
those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not what you want.





tokenizing with PLY's 'lex' module


PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and the match value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with the matched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match is ignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts the value to an int.

import lex

tokens = (
    "SYMBOL",
    "COUNT"
)

t_SYMBOL = (
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

lex.input("CH3COOH")
for tok in iter(lex.token, None):
    print repr(tok.type), repr(tok.value)


BTW, iter(f, sentinel) is a handy function. It calls f() and returns the return value using the iterator protocol. When the returned value equals the sentinel value it raises a StopIterationError. In other words, I converted the sequence of multiple function calls into an iterator, letting my use a for loop.
When I run the code I get the following

'SYMBOL' 'C'
'SYMBOL' 'H'
'COUNT' 3
'SYMBOL' 'C'
'SYMBOL' 'O'
'SYMBOL' 'O'
'SYMBOL' 'H'

You can see that the count was properly converted to an integer

No comments:

Post a Comment