Tuesday, 17 November 2015

3.3 Python Lex and Token Values

Token values


When tokens are returned by lex, they have a value that is stored in the value attribute. Normally, the value is the text that was matched. However, the value can be assigned to any Python object. For instance, when lexing identifiers, you may want to return both the identifier name and information from some sort of symbol table. To do this, you might write a rule like this:
def t_ID(t):
    ...
    # Look up symbol table information and return a tuple
    t.value = (t.value, symbol_lookup(t.value))
    ...
    return t
It is important to note that storing data in other attribute names is not recommended. The yacc.py module only exposes the contents of the value attribute. Thus, accessing other attributes may be unnecessarily awkward.


Literal characters


Literal characters can be specified by defining a variable literals in your lexing module. For example:
literals = [ '+','-','*','/' ]
or alternatively
literals = "+-*/"
A literal character is simply a single character that is returned "as is" when encountered by the lexer. Literals are checked after all of the defined regular expression rules. Thus, if a rule starts with one of the literal characters, it will always take precedence.
When a literal token is returned, both its type and value attributes are set to the character itself. For example, '+'


Discarded tokens


To discard a token, such as a comment, simply define a token rule that returns no value. For example:
def t_COMMENT(t):
    r'\#.*'
    pass
    # No return value. Token discarded
Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example:
t_ignore_COMMENT = r'\#.*'
Be advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are sorted by regular expression length).

 Ignored characters


The special t_ignore rule is reserved by lex.py for characters that should be completely ignored in the input stream. Usually this is used to skip over whitespace and other non-essential characters. Although it is possible to define a regular expression rule for whitespace in a manner similar to t_newline(), the use of t_ignore provides substantially better lexing performance because it is handled as a special case and is checked in a much more efficient manner than the normal regular expression rules.

Error handling


Finally, the t_error() function is used to handle lexing errors that occur when illegal characters are detected. In this case, the t.value attribute contains the rest of the input string that has not been tokenized. In the example, the error function was defined as follows:
# Error handling rule
def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)
In this case, we simply print the offending character and skip ahead one character by calling t.lexer.skip(1).

Examples:




No comments:

Post a Comment