The following example shows how lex.py is used to write a simple tokenizer.
# ------------------------------------------------------------ # calclex.py # # tokenizer for a simple expression evaluator for # numbers and +,-,*,/ # ------------------------------------------------------------ import ply.lex as lex # List of token names. This is always required tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN', ) # Regular expression rules for simple tokens t_PLUS = r'\+' t_MINUS = r'-' t_TIMES = r'\*' t_DIVIDE = r'/' t_LPAREN = r'\(' t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' t.value = int(t.value) return t # Define a rule so we can track line numbers def t_newline(t): r'\n+' t.lexer.lineno += len(t.value) # A string containing ignored characters (spaces and tabs) t_ignore = ' \t' # Error handling rule def t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1) # Build the lexer lexer = lex.lex()
To use the lexer, you first need to feed it some input text using its input() method. After that, repeated calls to token() produce tokens. The following code shows how this works:
# Test it out data = ''' 3 + 4 * 10 + -20 *2 ''' # Give the lexer some input lexer.input(data) # Tokenize while True: tok = lexer.token() if not tok: break # No more input print(tok)
When executed, the example will produce the following output:
$ python example.py LexToken(NUMBER,3,2,1) LexToken(PLUS,'+',2,3) LexToken(NUMBER,4,2,5) LexToken(TIMES,'*',2,7) LexToken(NUMBER,10,2,10) LexToken(PLUS,'+',3,14) LexToken(MINUS,'-',3,16) LexToken(NUMBER,20,3,18) LexToken(TIMES,'*',3,20) LexToken(NUMBER,2,3,21)
Lexers also support the iteration protocol. So, you can write the above loop as follows:
for tok in lexer: print(tok)
The tokens returned by lexer.token() are instances of LexToken. This object has attributes tok.type, tok.value, tok.lineno, and tok.lexpos. The following code shows an example of accessing these attributes:
# Tokenize while True: tok = lexer.token() if not tok: break # No more input print(tok.type, tok.value, tok.lineno, tok.lexpos)
The tok.type and tok.value attributes contain the type and value of the token itself. tok.line and tok.lexpos contain information about the location of the token. tok.lexpos is the index of the token relative to the start of the input text.
The tokens list
All lexers must provide a list tokens that defines all of the possible token names that can be produced by the lexer. This list is always required and is used to perform a variety of validation checks. The tokens list is also used by the yacc.py module to identify terminals.
In the example, the following code specified the token names:
tokens = ( 'NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN', )
No comments:
Post a Comment