Token values
When tokens are returned by lex, they have a value that is stored in the value attribute. Normally, the value is the text that was matched. However, the value can be assigned to any Python object. For instance, when lexing identifiers, you may want to return both the identifier name and information from some sort of symbol table. To do this, you might write a rule like this:
It is important to note that storing data in other attribute names is not recommended. The yacc.py module only exposes the contents of the value attribute. Thus, accessing other attributes may be unnecessarily awkward.def t_ID(t): ... # Look up symbol table information and return a tuple t.value = (t.value, symbol_lookup(t.value)) ... return t
Literal characters
Literal characters can be specified by defining a variable literals in your lexing module. For example:
or alternativelyliterals = [ '+','-','*','/' ]
A literal character is simply a single character that is returned "as is" when encountered by the lexer. Literals are checked after all of the defined regular expression rules. Thus, if a rule starts with one of the literal characters, it will always take precedence.literals = "+-*/"
When a literal token is returned, both its type and value attributes are set to the character itself. For example, '+'
Discarded tokens
Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example:def t_COMMENT(t): r'\#.*' pass # No return value. Token discarded
Be advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are sorted by regular expression length).t_ignore_COMMENT = r'\#.*'
Ignored characters
Error handling
Finally, the t_error() function is used to handle lexing errors that occur when illegal characters are detected. In this case, the t.value attribute contains the rest of the input string that has not been tokenized. In the example, the error function was defined as follows:
In this case, we simply print the offending character and skip ahead one character by calling t.lexer.skip(1).# Error handling rule def t_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1)
No comments:
Post a Comment