In this case, the name following the t_ must exactly match one of the names supplied in tokens. If some kind of action needs to be performed, a token rule can be specified as a function. For example, this rule matches numbers and converts the string into a Python integer.t_PLUS = r'\+'
When a function is used, the regular expression rule is specified in the function documentation string. The function always takes a single argument which is an instance of LexToken. This object has attributes of t.type which is the token type (as a string), t.value which is the lexeme (the actual text matched), t.lineno which is the current line number, and t.lexpos which is the position of the token relative to the beginning of the input text. By default, t.type is set to the name following the t_ prefix. The action function can modify the contents of the LexToken object as appropriate. However, when it is done, the resulting token should be returned. If no value is returned by the action function, the token is simply discarded and the next token read.def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Number %s is too large!" % t.value t.value = 0 return t
Internally, lex.py uses the re module to do its patten matching. When building the master regular expression, rules are added in the following order:
- All tokens defined by functions are added in the same order as they appear in the lexer file.
- Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function like this:
This approach greatly reduces the number of regular expression rules and is likely to make things a little faster.reserved = { 'if' : 'IF', 'then' : 'THEN', 'else' : 'ELSE', 'while' : 'WHILE', ... } def t_ID(t): r'[a-zA-Z_][a-zA-Z_0-9]*' t.type = reserved.get(t.value,'ID') # Check for reserved words return t
Note: You should avoid writing individual rules for reserved words. For example, if you write rules like this,
those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not what you want.t_FOR = r'for' t_PRINT = r'print'
tokenizing with PLY's 'lex' module
PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and the match value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with the matched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match is ignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts the value to an int.
import lex tokens = ( "SYMBOL", "COUNT" ) t_SYMBOL = ( r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|" r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|" r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|" r"U|V|W|Xe|Yb?|Z[nr]") def t_COUNT(t): r"\d+" t.value = int(t.value) return t def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,)) lex.lex() lex.input("CH3COOH") for tok in iter(lex.token, None): print repr(tok.type), repr(tok.value)
BTW, iter(f, sentinel) is a handy function. It calls f() and returns the return value using the iterator protocol. When the returned value equals the sentinel value it raises a StopIterationError. In other words, I converted the sequence of multiple function calls into an iterator, letting my use a for loop.
When I run the code I get the following
'SYMBOL' 'C' 'SYMBOL' 'H' 'COUNT' 3 'SYMBOL' 'C' 'SYMBOL' 'O' 'SYMBOL' 'O' 'SYMBOL' 'H'
You can see that the count was properly converted to an integer
No comments:
Post a Comment