basics of lexing and parsing
a simple introduction to lexing and parsing, and how to use them to lex, parse and evaluate simple mathematical expressions.
2024-02-11 (9 months ago)this guide will cover the basics of lexing and parsing, and how to use them to lex, parse and evaluate simple mathematical expressions. it is not meant to be a comprehensive guide, but rather a simple introduction to the topic.
before we start, let’s define what lexing and parsing are.
lexing is the process of taking a string of characters and turning it into a list of tokens. a token is a piece of text that has a meaning in the language. for example, in the expression 1 + 2
, the tokens are 1
, +
and 2
.
parsing is the process of taking a list of tokens and turning it into an abstract syntax tree (AST). an AST is a tree-like representation of the expression. for example, the AST for 1 + 2
would look like this:
+
/ \
1 2
now that we know what lexing and parsing are, let’s get started.
prerequisites
this guide assumes you have a good understanding of programming, datastructures and algorithms. it also assumes you have a basic understanding of python.
throughout this guide, I will be using this reader class to simplify reading and peeking characters and tokens:
from typing import Iterable, Callable, Iterator
class Reader[T]:
def __init__(self, data: Iterable[T]) -> None:
self.data = data
self.index = 0
def position(self) -> int:
"""Returns the current position of the reader."""
return self.index
def seek(self, index: int) -> None:
"""Sets the current position of the reader."""
self.index = index
def peek(self) -> T | None:
"""Returns the next value without advancing the reader."""
if self.index >= len(self.data):
return None
return self.data[self.index]
def next(self) -> T | None:
"""Returns the next value and advances the reader."""
if self.index >= len(self.data):
return None
self.index += 1
return self.data[self.index - 1]
def skip(self, count: int = 1) -> None:
"""Advances the reader by the given amount."""
self.index += count
def expect(self, predicate: Callable[[T], bool]) -> bool:
"""Advances the reader if the next value matches the given predicate."""
if (value := self.peek()) is not None and predicate(value):
self.skip()
return True
return False
def next_while(self, predicate: Callable[[T], bool]) -> Iterator[T]:
"""Returns the next values while the given predicate is true."""
while (value := self.peek()) is not None and predicate(value):
self.skip()
yield value
def skip_while(self, predicate: Callable[[T], bool]) -> None:
"""Advances the reader while the given predicate is true."""
for _ in self.next_while(predicate):
pass
the class does not do much, but it will make writing the lexer and parser easier, as we will see later.
lexing
the first step to lexing is defining our tokens. for math expressions we will need tokens to represent numbers, operators and parentheses.
we will be using a flag enum to represent token types:
from enum import Flag, auto
class TokenType(Flag):
Number = auto()
Plus = auto()
Minus = auto()
Multiply = auto()
Divide = auto()
Power = auto()
Modulo = auto()
LeftParen = auto()
RightParen = auto()
Operator = Plus | Minus | Multiply | Divide | Power | Modulo
Paren = LeftParen | RightParen
if you are not familiar with flag enums, they are enums that can be combined with the bitwise operators |
and &
. for example, TokenType.Plus | TokenType.Minus
is a valid token type, and TokenType.Plus & TokenType.Minus
is not.
in this case, we are using flag enums to group tokens together. for example, TokenType.Operator
is a group of all the operator tokens, and TokenType.Paren
is a group of all the parentheses tokens.
to simplify things we will also represent operator precedence as fields in the enum:
# ...
class TokenType(Flag):
# ...
FirstPrecedence = Plus | Minus
SecondPrecedence = Multiply | Divide | Modulo
ThirdPrecedence = Power
lets also define a helper function to convert a string character to a token type:
class TokenType(Flag):
# ...
@staticmethod
def from_char(char: str) -> TokenType.Operator | TokenType.Paren | None:
if char == "+":
return TokenType.Plus
if char == "-":
return TokenType.Minus
if char == "*":
return TokenType.Multiply
if char == "/":
return TokenType.Divide
if char == "%":
return TokenType.Modulo
if char == "^":
return TokenType.Power
if char == "(":
return TokenType.LeftParen
if char == ")":
return TokenType.RightParen
return None
to wrap up tokens we will define a token class:
class Token:
def __init__(self, type: TokenType, value: str | float) -> None:
self.type = type
self.value = value
our token class has just one job, to hold a token type and a value. for example, the token for the number 123
would have the type TokenType.Number
and the value 123
.
if this were a real lexer, we would also include the span (start and end position) of the token in the source string, but for simplicity we will not do that here.
now that we have our tokens and token types defined, we can start writing the lexer.
the job of the lexer is to take a string of characters and turn it into a list of tokens. since we are only dealing with numbers and a few symbols this is a simple task.
our lexer will take the source string and wrap it in a reader.
class Lexer:
def __init__(self, src: str) -> None:
self.reader = Reader(src)
from this point onward every codeblock will be a method of the lexer class.
lets breakdown what we need the lexer to parse:
- • whitespace (we need to know what it is in order to ignore it)
- • numbers
- • operators (
+
,-
,*
,/
,%
,^
) - • control characters (
(
,)
)
the first thing we need to do is to skip any whitespace. we can do this by using the skip_while
method of the reader class.
the method will advance the reader until it finds a non-whitespace character.
# method skips whitespace so it should return None
def whitespace(self) -> None:
self.reader.skip_while(lambda char: char.isspace())
next, we need to parse numbers. for this demonstration, we will do it in a lazy way. (this method is explained in the comments)
# method parses numbers so it should return a Token or None
def number(self) -> Token | None:
# lets keep track of the initial position of the reader
position = self.reader.position()
# this variable will hold onto the characters that make
# up the token we are trying to parse
lexeme = ""
# helper function to check if a char makes up a number
is_num = lambda char: char.isdigit() or char == "." or char == "-"
# we will use the next_while method to advance the reader
# and collect the characters that make up the token
# (the reader is being advanced by the next_while method)
for char in self.reader.next_while(is_num):
# while the character can be part of a number
# we will add it to the lexeme
lexeme += char
try:
number = float(lexeme)
except ValueError:
# if the lexeme is not a valid number
# we will reset the reader to the initial position
# and return None
self.reader.seek(position)
return None
# if the lexeme is a valid number
# we will return a token with the number value
return Token(TokenType.Number, number)
the reason we seek the reader back to the initial position if the lexeme is not a valid number is because we want to leave the reader in the same position it was before we tried to parse the number. this way, we can try to parse a different token.
the next tokens we need to parse are the operators and parentheses. we can do both in the same method.
# method parses operators and parentheses so it should return a Token or None
def operator_or_control(self) -> Token | None:
# peek ahead to the next character (without advancing the reader)
char = self.reader.peek()
# get the token type for the character
tt = TokenType.from_char(char)
# if the character is not a valid token type
# we will return None
if tt is None:
# because we peeked ahead, the reader is still
# at the same position, so we don't need to seek
return None
# if the character is a valid token type
# we will advance the reader and return a token
self.reader.skip()
return Token(tt, char)
we can now combine our parsing methods into a single method that tries to parse a single token.
# try to lex and return a single token
def lex_one(self) -> Token | None:
# skip any whitespace
self.whitespace()
# try to parse a number
if (token := self.number()) is not None:
return token
# try to parse an operator or control character
if (token := self.operator_or_control()) is not None:
return token
# if we couldn't parse any token, return None
return None
this can be more elegantly written as
def lex_one(self) -> Token | None:
self.whitespace()
return self.number() or self.operator_or_control()
now that we have a method to parse a single token, we can write a method to parse all the tokens in the source string.
# try to lex and return all tokens
def lex(self) -> list[Token]:
tokens = []
# while we can parse a token, add it to the list
while token := self.lex_one():
tokens.append(token)
return tokens
that’s it for the lexer! we can now use it to lex a source string into a list of tokens.
lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()
for token in tokens:
print(f"{token.type.name}: {token.value}")
Number: 1.0
Plus: +
Number: 2.0
Multiply: *
LeftParen: (
Number: 3.0
Minus: -
Number: 4.0
RightParen: )
parsing
now that we have a working lexer and a list of tokens, we can start writing the parser.
parsing is a little more complex than lexing, our parser needs to take the list of tokens and turn it into an abstract syntax tree (AST).
we will be building a simple recursive descent parser. a recursive descent parser is a type of top-down parser that works by recursively calling methods to parse the input.
our AST will be a simple tree-like structure, where each node holds onto children nodes and a value(s).
lets define the base node class for our AST:
class Node:
# since we are only dealing with numbers and operators
# every node should evaluate to a number
def eval(self) -> float:
# this method will be implemented by the subclasses
raise NotImplementedError
lets think about what potential nodes our tree could have:
- • a number node
- • an operator node
thats it! we can represent any mathematical expression with just these two nodes.
our number node can remain as is:
class Number(Node):
def __init__(self, value: float) -> None:
self.value = value
# the number node just returns its value
def eval(self) -> float:
return self.value
however our operator node is an node that represents a binary operation on two children nodes. so lets define it as such:
class BinaryOperation(Node):
def __init__(self, left: Node, right: Node, op: TokenType) -> None:
self.left = left
self.right = right
self.op = op
def eval(self) -> float:
# we will come back to this method later
pass
now that we have our nodes defined, we can begin writing the parser.
class Parser:
def __init__(self, tokens: list[Token]) -> None:
# once again we will wrap the tokens in a reader
self.reader = Reader(tokens)
from this point onward every codeblock will be a method of the parser class.
lets breakdown what we need the parser to parse:
- • individual numbers
- • entering and exiting parentheses
- • entire expressions
- • addition and subtraction
- • multiplication, division and modulo
- • exponentiation
individual numbers, entering and exiting parentheses and entire expressions can fall under the same method. we can call this method parse_atom
because it will parse the smallest unit of an expression.
# parse atom should always return a Node
def parse_atom(self) -> Node:
# if we are expecting an atom but we reach
# the end of the input, we will raise an exception
if (token := self.reader.next()) is None:
raise Exception("Unexpected end of input")
match token.type:
case TokenType.Number:
# if the token is a number, we will return a number node
return Number(token.value)
case TokenType.LeftParen:
# if the token is a left parenthesis, we will parse an expression
node = self.parse_expression()
# and then expect a right parenthesis
if not self.reader.expect(lambda token: token.type == TokenType.RightParen):
raise Exception("Expected closing parenthesis")
# and then return the parsed expression
return node
case _:
# if the token is not a number or a left parenthesis
# we will raise an exception
raise Exception(f"Unexpected token: {token.type.name}")
inside parse_atom
we are using the parse_expression
method which haven’t defined yet. we will come back to it later.
the next thing we will parse is exponentiation. like every other operator, exponentiation is a binary operation.
def parse_power(self) -> Node:
# we will start by parsing the left operand
lhs = self.parse_atom()
# helper function to check if a token is an exponentiation operator
third_prec = lambda token: token.type in TokenType.ThirdPrecedence
# while the next token is an exponentiation operator
for token in self.reader.next_while(third_prec):
# we will parse the right operand
rhs = self.parse_atom()
# and then create a binary operation node
# this node will be the new left operand for the next iteration
# it contains the previous left operand
lhs = BinaryOperation(lhs, rhs, token.type)
# once we are done parsing all the exponentiation operators
# we will return the left operand which will now be the
# root of the exponentiation subtree
return lhs
next we will parse multiplication, division and modulo. this is similar to parsing exponentiation, but with a different precedence.
def parse_mul_div_mod(self) -> Node:
# we will start by parsing the left operand
# we will use the parse_power method to do this because
# exponentiation has a higher precedence than multiplication, division and modulo
lhs = self.parse_power()
# helper function to check if a token is a multiplication, division or modulo operator
second_prec = lambda token: token.type in TokenType.SecondPrecedence
# while the next token is in the second precedence group
for token in self.reader.next_while(second_prec):
# we will parse the right operand
rhs = self.parse_power()
lhs = BinaryOperation(lhs, rhs, token.type)
# we will return the left operand which will now be the
# root of the multiplication, division and modulo subtree
return lhs
the reason we are using the parse_power
method to parse the left operand is because exponentiation has a higher precedence than multiplication, division and modulo. this is the beauty of recursive descent parsing, by
recursively calling higher precedence methods we can parse expressions with different precedence levels, and preserve the correct order of operations.
the next thing we will parse is addition and subtraction.
def parse_add_sub(self) -> Node:
# we will start by parsing the left operand
# once again we call a higher precedence method to do this
lhs = self.parse_mul_div_mod()
# helper function to check if a token is in the first precedence group
first_prec = lambda token: token.type in TokenType.FirstPrecedence
# while the next token is an addition or subtraction operator
for token in self.reader.next_while(first_prec):
rhs = self.parse_mul_div_mod()
lhs = BinaryOperation(lhs, rhs, token.type)
return lhs
finally we can define the parse_expression
method which will parse an entire expression. (this method is just a wrapper around parse_add_sub
for simplicity)
def parse_expression(self) -> Node:
return self.parse_add_sub()
def parse(self) -> Node:
return self.parse_expression()
if you trace the calls from the parse_expression
method to the parse_atom method, you will see that we are parsing the entire expression from the bottom up. we start by parsing the smallest unit of an expression, and then build up the expression by parsing larger and larger units. this is the essence of recursive descent parsing, we are recursively calling methods to parse the input and building up the AST as we go.
now that we have our parser defined, we can use it to parse a list of tokens into an AST.
lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse()
lets jump back to the eval
method of the BinaryOperation
class and implement it.
class BinaryOperation(Node):
# ...
def eval(self) -> float:
# we will start by recursively evaluating the left operand
left = self.left.eval()
# then we will by recursively evaluate the right operand
right = self.right.eval()
# and then apply the operator to the left and right operands
match self.op:
case TokenType.Plus:
return left + right
case TokenType.Minus:
return left - right
case TokenType.Multiply:
return left * right
case TokenType.Divide:
return left / right
case TokenType.Power:
return left ** right
case TokenType.Modulo:
return left % right
it’s really that simple! we traverse down the AST and evaluate the left and right operands and then apply the operator to the results, traversing back up the AST.
lets test our parser by evaluating the AST.
lexer = Lexer("1 + 2 * (3 - 4)")
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse()
print(ast.eval())
-1.0
you don’t have to be shy with the expressions, try to evaluate
1 + 2 * -3 ^ 4 % 5 - (6 + (-2 + 2)) * 8 + 9 / 10 * 11 ^ 12 % 13
you should get
-44.10009765625
and that’s it! we have successfully lexed, parsed and evaluated a simple mathematical expression.
takeaways
- • lexing is the process of turning a string of characters into a list of tokens
- • parsing is the process of turning a list of tokens into an abstract syntax tree (AST)
- • recursive descent parsing is a type of top-down parsing that works by recursively calling methods to parse the input
homework
- • add support for unary operators (e.g.
-1
) - • add support for simple functions (e.g.
sin(1)
) - • add support for constant variables (e.g.
PI + 1
)