Python PLY Programming: 5. Syntax Error Handling

When a syntax error occurs during parsing, the error is immediately detected (i.e., the parser does not read any more tokens beyond the source of the error). Error recovery in LR parsers is a delicate topic that involves ancient rituals and black-magic. The recovery mechanism provided by yacc.py is comparable to Unix yacc so you may want consult a book like O'Reilly's "Lex and Yacc" for some of the finer details.

When a syntax error occurs, yacc.py performs the following steps:

On the first occurrence of an error, the user-defined p_error() function is called with the offending token as an argument. Afterwards, the parser enters an "error-recovery" mode in which it will not make future calls to p_error() until it has successfully shifted at least 3 tokens onto the parsing stack.
If no recovery action is taken in p_error(), the offending lookahead token is replaced with a special error token.
If the offending lookahead token is already set to error, the top item of the parsing stack is deleted.
If the entire parsing stack is unwound, the parser enters a restart state and attempts to start parsing from its initial state.
If a grammar rule accepts error as a token, it will be shifted onto the parsing stack.
If the top item of the parsing stack is error, lookahead tokens will be discarded until the parser can successfully shift a new symbol or reduce a rule involving error.

Recovery and resynchronization with error rules

The most well-behaved approach for handling syntax errors is to write grammar rules that include the error token. For example, suppose your language had a grammar rule for a print statement like this:

def p_statement_print(p):
     'statement : PRINT expr SEMI'
     ...

To account for the possibility of a bad expression, you might write an additional grammar rule like this:

def p_statement_print_error(p):
     'statement : PRINT error SEMI'
     print "Syntax error in print statement. Bad expression"

In this case, the error token will match any sequence of tokens that might appear up to the first semicolon that is encountered. Once the semicolon is reached, the rule will be invoked and the error token will go away.

This type of recovery is sometimes known as parser resynchronization. The error token acts as a wildcard for any bad input text and the token immediately following error acts as a synchronization token.

It is important to note that the error token usually does not appear as the last token on the right in an error rule. For example:

def p_statement_print_error(p):
    'statement : PRINT error'
    print "Syntax error in print statement. Bad expression"

This is because the first bad token encountered will cause the rule to be reduced--which may make it difficult to recover if more bad tokens immediately follow.

Panic mode recovery

An alternative error recovery scheme is to enter a panic mode recovery in which tokens are discarded to a point where the parser might be able to recover in some sensible manner.

Panic mode recovery is implemented entirely in the p_error() function. For example, this function starts discarding tokens until it reaches a closing '}'. Then, it restarts the parser in its initial state.

def p_error(p):
    print "Whoa. You are seriously hosed."
    # Read ahead looking for a closing '}'
    while 1:
        tok = yacc.token()             # Get the next token
        if not tok or tok.type == 'RBRACE': break
    yacc.restart()

This function simply discards the bad token and tells the parser that the error was ok.

def p_error(p):
    print "Syntax error at token", p.type
    # Just discard the token and tell the parser it's okay.
    yacc.errok()

Within the p_error() function, three functions are available to control the behavior of the parser:

yacc.errok(). This resets the parser state so it doesn't think it's in error-recovery mode. This will prevent an error token from being generated and will reset the internal error counters so that the next syntax error will call p_error() again.
yacc.token(). This returns the next token on the input stream.
yacc.restart(). This discards the entire parsing stack and resets the parser to its initial state.

Note: these functions are only available when invoking p_error() and are not available at any other time.

To supply the next lookahead token to the parser, p_error() can return a token. This might be useful if trying to synchronize on special characters. For example:

def p_error(p):
    # Read ahead looking for a terminating ";"
    while 1:
        tok = yacc.token()             # Get the next token
        if not tok or tok.type == 'SEMI': break
    yacc.errok()

    # Return SEMI to the parser as the next lookahead token
    return tok

Python PLY Programming

Wednesday, 18 November 2015

5. Syntax Error Handling

Recovery and resynchronization with error rules

Panic mode recovery

No comments:

Post a Comment