basics of lexing and parsing

a simple introduction to lexing and parsing, and how to use them to lex, parse and evaluate simple mathematical expressions.

2024-02-11 (1 year ago)

this guide will cover the basics of lexing and parsing, and how to use them to lex, parse and evaluate simple mathematical expressions. it is not meant to be a comprehensive guide, but rather a simple introduction to the topic.

before we start, let’s define what lexing and parsing are.

lexing is the process of taking a string of characters and turning it into a list of tokens. a token is a piece of text that has a meaning in the language. for example, in the expression 1 + 2, the tokens are 1, + and 2.

parsing is the process of taking a list of tokens and turning it into an abstract syntax tree (AST). an AST is a tree-like representation of the expression. for example, the AST for 1 + 2 would look like this:

   +
   / \
  1   2

now that we know what lexing and parsing are, let’s get started.

prerequisites

this guide assumes you have a good understanding of programming, datastructures and algorithms. it also assumes you have a basic understanding of python.

throughout this guide, I will be using this reader class to simplify reading and peeking characters and tokens:

from typing import Iterable, Callable, Iterator


class Reader[T]:
    def __init__(self, data: Iterable[T]) -> None:
        self.data = data
        self.index = 0

    def position(self) -> int:
        """Returns the current position of the reader."""
        return self.index

    def seek(self, index: int) -> None:
        """Sets the current position of the reader."""
        self.index = index

    def peek(self) -> T | None:
        """Returns the next value without advancing the reader."""
        if self.index >= len(self.data):
            return None

        return self.data[self.index]

    def next(self) -> T | None:
        """Returns the next value and advances the reader."""
        if self.index >= len(self.data):
            return None

        self.index += 1
        return self.data[self.index - 1]

    def skip(self, count: int = 1) -> None:
        """Advances the reader by the given amount."""
        self.index += count

    def expect(self, predicate: Callable[[T], bool]) -> bool:
        """Advances the reader if the next value matches the given predicate."""
        if (value := self.peek()) is not None and predicate(value):
            self.skip()
            return True

        return False

    def next_while(self, predicate: Callable[[T], bool]) -> Iterator[T]:
        """Returns the next values while the given predicate is true."""
        while (value := self.peek()) is not None and predicate(value):
            self.skip()
            yield value

    def skip_while(self, predicate: Callable[[T], bool]) -> None:
        """Advances the reader while the given predicate is true."""
        for _ in self.next_while(predicate):
            pass

the class does not do much, but it will make writing the lexer and parser easier, as we will see later.

lexing

the first step to lexing is defining our tokens. for math expressions we will need tokens to represent numbers, operators and parentheses.

we will be using a flag enum to represent token types:

from enum import Flag, auto

class TokenType(Flag):
    Number = auto()

    Plus = auto()
    Minus = auto()
    Multiply = auto()
    Divide = auto()
    Power = auto()
    Modulo = auto()

    LeftParen = auto()
    RightParen = auto()

    Operator = Plus | Minus | Multiply | Divide | Power | Modulo
    Paren = LeftParen | RightParen

if you are not familiar with flag enums, they are enums that can be combined with the bitwise operators | and &. for example, TokenType.Plus | TokenType.Minus is a valid token type, and TokenType.Plus & TokenType.Minus is not.

in this case, we are using flag enums to group tokens together. for example, TokenType.Operator is a group of all the operator tokens, and TokenType.Paren is a group of all the parentheses tokens.

to simplify things we will also represent operator precedence as fields in the enum:

# ...
class TokenType(Flag):
    # ...

    FirstPrecedence = Plus | Minus
    SecondPrecedence = Multiply | Divide | Modulo
    ThirdPrecedence = Power

lets also define a helper function to convert a string character to a token type:

class TokenType(Flag):
    # ...

    @staticmethod
    def from_char(char: str) -> TokenType.Operator | TokenType.Paren | None:
        if char == "+":
            return TokenType.Plus
        if char == "-":
            return TokenType.Minus
        if char == "*":
            return TokenType.Multiply
        if char == "/":
            return TokenType.Divide
        if char == "%":
            return TokenType.Modulo
        if char == "^":
            return TokenType.Power
        if char == "(":
            return TokenType.LeftParen
        if char == ")":
            return TokenType.RightParen

        return None

to wrap up tokens we will define a token class:

class Token:
    def __init__(self, type: TokenType, value: str | float) -> None:
        self.type = type
        self.value = value

our token class has just one job, to hold a token type and a value. for example, the token for the number 123 would have the type TokenType.Number and the value 123.

if this were a real lexer, we would also include the span (start and end position) of the token in the source string, but for simplicity we will not do that here.

now that we have our tokens and token types defined, we can start writing the lexer.

the job of the lexer is to take a string of characters and turn it into a list of tokens. since we are only dealing with numbers and a few symbols this is a simple task.

our lexer will take the source string and wrap it in a reader.

class Lexer:
    def __init__(self, src: str) -> None:
        self.reader = Reader(src)

from this point onward every codeblock will be a method of the lexer class.

lets breakdown what we need the lexer to parse:

• whitespace (we need to know what it is in order to ignore it)
• numbers
• operators (+, -, *, /, %, ^)
• control characters ((, ))

the first thing we need to do is to skip any whitespace. we can do this by using the skip_while method of the reader class.

the method will advance the reader until it finds a non-whitespace character.

# method skips whitespace so it should return None
def whitespace(self) -> None:
    self.reader.skip_while(lambda char: char.isspace())

next, we need to parse numbers. for this demonstration, we will do it in a lazy way. (this method is explained in the comments)

# method parses numbers so it should return a Token or None
def number(self) -> Token | None:
    # lets keep track of the initial position of the reader
    position = self.reader.position()

    # this variable will hold onto the characters that make
    # up the token we are trying to parse
    lexeme = ""

    # helper function to check if a char makes up a number
    is_num = lambda char: char.isdigit() or char == "." or char == "-"

    # we will use the next_while method to advance the reader
    # and collect the characters that make up the token
    # (the reader is being advanced by the next_while method)
    for char in self.reader.next_while(is_num):
        # while the character can be part of a number
        # we will add it to the lexeme
        lexeme += char

    try:
        number = float(lexeme)
    except ValueError:
        # if the lexeme is not a valid number
        # we will reset the reader to the initial position
        # and return None
        self.reader.seek(position)
        return None

    # if the lexeme is a valid number
    # we will return a token with the number value
    return Token(TokenType.Number, number)

the reason we seek the reader back to the initial position if the lexeme is not a valid number is because we want to leave the reader in the same position it was before we tried to parse the number. this way, we can try to parse a different token.

the next tokens we need to parse are the operators and parentheses. we can do both in the same method.

# method parses operators and parentheses so it should return a Token or None
def operator_or_control(self) -> Token | None:
    # peek ahead to the next character (without advancing the reader)
    char = self.reader.peek()

    # get the token type for the character
    tt = TokenType.from_char(char)

    # if the character is not a valid token type
    # we will return None
    if tt is None:
        # because we peeked ahead, the reader is still
        # at the same position, so we don't need to seek
        return None

    # if the character is a valid token type
    # we will advance the reader and return a token
    self.reader.skip()

    return Token(tt, char)

we can now combine our parsing methods into a single method that tries to parse a single token.

# try to lex and return a single token
def lex_one(self) -> Token | None:
    # skip any whitespace
    self.whitespace()

    # try to parse a number
    if (token := self.number()) is not None:
        return token

    # try to parse an operator or control character
    if (token := self.operator_or_control()) is not None:
        return token

    # if we couldn't parse any token, return None
    return None

this can be more elegantly written as

def lex_one(self) -> Token | None:
    self.whitespace()
    return self.number() or self.operator_or_control()

now that we have a method to parse a single token, we can write a method to parse all the tokens in the source string.

# try to lex and return all tokens
def lex(self) -> list[Token]:
    tokens = []

    # while we can parse a token, add it to the list
    while token := self.lex_one():
        tokens.append(token)

    return tokens

that’s it for the lexer! we can now use it to lex a source string into a list of tokens.

lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()

for token in tokens:
    print(f"{token.type.name}: {token.value}")

Number: 1.0
Plus: +
Number: 2.0
Multiply: *
LeftParen: (
Number: 3.0
Minus: -
Number: 4.0
RightParen: )

parsing

now that we have a working lexer and a list of tokens, we can start writing the parser.

parsing is a little more complex than lexing, our parser needs to take the list of tokens and turn it into an abstract syntax tree (AST).

we will be building a simple recursive descent parser. a recursive descent parser is a type of top-down parser that works by recursively calling methods to parse the input.

our AST will be a simple tree-like structure, where each node holds onto children nodes and a value(s).

lets define the base node class for our AST:

class Node:
    # since we are only dealing with numbers and operators
    # every node should evaluate to a number
    def eval(self) -> float:
        # this method will be implemented by the subclasses
        raise NotImplementedError

lets think about what potential nodes our tree could have:

• a number node
• an operator node

thats it! we can represent any mathematical expression with just these two nodes.

our number node can remain as is:

class Number(Node):
    def __init__(self, value: float) -> None:
        self.value = value

    # the number node just returns its value
    def eval(self) -> float:
        return self.value

however our operator node is an node that represents a binary operation on two children nodes. so lets define it as such:

class BinaryOperation(Node):
    def __init__(self, left: Node, right: Node, op: TokenType) -> None:
        self.left = left
        self.right = right
        self.op = op

    def eval(self) -> float:
        # we will come back to this method later
        pass

now that we have our nodes defined, we can begin writing the parser.

class Parser:
    def __init__(self, tokens: list[Token]) -> None:
        # once again we will wrap the tokens in a reader
        self.reader = Reader(tokens)

from this point onward every codeblock will be a method of the parser class.

lets breakdown what we need the parser to parse:

• individual numbers
• entering and exiting parentheses
• entire expressions
• addition and subtraction
• multiplication, division and modulo
• exponentiation

individual numbers, entering and exiting parentheses and entire expressions can fall under the same method. we can call this method parse_atom because it will parse the smallest unit of an expression.

# parse atom should always return a Node
def parse_atom(self) -> Node:
    # if we are expecting an atom but we reach
    # the end of the input, we will raise an exception
    if (token := self.reader.next()) is None:
        raise Exception("Unexpected end of input")

    match token.type:
        case TokenType.Number:
            # if the token is a number, we will return a number node
            return Number(token.value)

        case TokenType.LeftParen:
            # if the token is a left parenthesis, we will parse an expression
            node = self.parse_expression()
            # and then expect a right parenthesis
            if not self.reader.expect(lambda token: token.type == TokenType.RightParen):
                raise Exception("Expected closing parenthesis")

            # and then return the parsed expression
            return node

        case _:
            # if the token is not a number or a left parenthesis
            # we will raise an exception
            raise Exception(f"Unexpected token: {token.type.name}")

inside parse_atom we are using the parse_expression method which haven’t defined yet. we will come back to it later.

the next thing we will parse is exponentiation. like every other operator, exponentiation is a binary operation.

def parse_power(self) -> Node:
    # we will start by parsing the left operand
    lhs = self.parse_atom()

    # helper function to check if a token is an exponentiation operator
    third_prec = lambda token: token.type in TokenType.ThirdPrecedence

    # while the next token is an exponentiation operator
    for token in self.reader.next_while(third_prec):
        # we will parse the right operand
        rhs = self.parse_atom()

        # and then create a binary operation node
        # this node will be the new left operand for the next iteration
        # it contains the previous left operand
        lhs = BinaryOperation(lhs, rhs, token.type)

    # once we are done parsing all the exponentiation operators
    # we will return the left operand which will now be the
    # root of the exponentiation subtree
    return lhs

next we will parse multiplication, division and modulo. this is similar to parsing exponentiation, but with a different precedence.

def parse_mul_div_mod(self) -> Node:
    # we will start by parsing the left operand
    # we will use the parse_power method to do this because
    # exponentiation has a higher precedence than multiplication, division and modulo
    lhs = self.parse_power()

    # helper function to check if a token is a multiplication, division or modulo operator
    second_prec = lambda token: token.type in TokenType.SecondPrecedence

    # while the next token is in the second precedence group
    for token in self.reader.next_while(second_prec):
        # we will parse the right operand
        rhs = self.parse_power()
        lhs = BinaryOperation(lhs, rhs, token.type)

    # we will return the left operand which will now be the
    # root of the multiplication, division and modulo subtree
    return lhs

the reason we are using the parse_power method to parse the left operand is because exponentiation has a higher precedence than multiplication, division and modulo. this is the beauty of recursive descent parsing, by recursively calling higher precedence methods we can parse expressions with different precedence levels, and preserve the correct order of operations.

the next thing we will parse is addition and subtraction.

def parse_add_sub(self) -> Node:
    # we will start by parsing the left operand
    # once again we call a higher precedence method to do this
    lhs = self.parse_mul_div_mod()

    # helper function to check if a token is in the first precedence group
    first_prec = lambda token: token.type in TokenType.FirstPrecedence

    # while the next token is an addition or subtraction operator
    for token in self.reader.next_while(first_prec):
        rhs = self.parse_mul_div_mod()
        lhs = BinaryOperation(lhs, rhs, token.type)

    return lhs

finally we can define the parse_expression method which will parse an entire expression. (this method is just a wrapper around parse_add_sub for simplicity)

def parse_expression(self) -> Node:
    return self.parse_add_sub()

def parse(self) -> Node:
    return self.parse_expression()

if you trace the calls from the parse_expression method to the parse_atom method, you will see that we are parsing the entire expression from the bottom up. we start by parsing the smallest unit of an expression, and then build up the expression by parsing larger and larger units. this is the essence of recursive descent parsing, we are recursively calling methods to parse the input and building up the AST as we go.

now that we have our parser defined, we can use it to parse a list of tokens into an AST.

lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()
parser = Parser(tokens)

ast = parser.parse()

lets jump back to the eval method of the BinaryOperation class and implement it.

class BinaryOperation(Node):
    # ...

    def eval(self) -> float:
        # we will start by recursively evaluating the left operand
        left = self.left.eval()

        # then we will by recursively  evaluate the right operand
        right = self.right.eval()

        # and then apply the operator to the left and right operands
        match self.op:
            case TokenType.Plus:
                return left + right
            case TokenType.Minus:
                return left - right
            case TokenType.Multiply:
                return left * right
            case TokenType.Divide:
                return left / right
            case TokenType.Power:
                return left ** right
            case TokenType.Modulo:
                return left % right

it’s really that simple! we traverse down the AST and evaluate the left and right operands and then apply the operator to the results, traversing back up the AST.

lets test our parser by evaluating the AST.

lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()
parser = Parser(tokens)

ast = parser.parse()

print(ast.eval())

-1.0

you don’t have to be shy with the expressions, try to evaluate

1 + 2 * -3 ^ 4 % 5 - (6 + (-2 + 2)) * 8 + 9 / 10 * 11 ^ 12 % 13

you should get

-44.10009765625

and that’s it! we have successfully lexed, parsed and evaluated a simple mathematical expression.

takeaways

• lexing is the process of turning a string of characters into a list of tokens
• parsing is the process of turning a list of tokens into an abstract syntax tree (AST)
• recursive descent parsing is a type of top-down parsing that works by recursively calling methods to parse the input

homework

• add support for unary operators (e.g. -1)
• add support for simple functions (e.g. sin(1))
• add support for constant variables (e.g. PI + 1)

find me somewhere else on the web

github linkedin email