../day3-ast

Published on: 2026-05-23

By Vishal Rashmika

Day 3: Representing Code: Grammars, BNF, and Abstract Syntax Trees

Welcome back, In Day 2, we built our lexical analyzer the component that transforms raw text into tokens. We went from a string of characters like x = 10 + y; to a neat sequence of labeled tokens: IDENTIFIER, EQUALS, NUMBER, PLUS, IDENTIFIER, SEMICOLON.

But here’s the thing: tokens are still just a flat list. They’re like words in a sentence. Just because you can read individual words doesn’t mean you understand the sentence’s structure or meaning. so, before we start building the parser we need to find a solution for this.

Today, we’re taking the next giant leap: learning how to represent the structure and meaning of code. This is where things get really interesting, because we’re going to discover that code isn’t just a sequence, it’s a tree.

If Day 2 was about recognizing the words, today is about understanding the grammar of the language that we are building.

The Problem: Flat Isn’t Enough

Imagine you have these flat list of tokens:

1 + 2 * 3

Which becomes:

NUMBER(1) PLUS NUMBER(2) STAR NUMBER(3)

Good, But what does this mean? More importantly, in what order should we evaluate it?

If you evaluate left to right, you get: (1 + 2) * 3 = 9
But if you remember your math class, you know it should be (BODMAS Rule/Order of Operations): 1 + (2 * 3) = 7

The multiplication has higher precedence, it should happen first. Our flat list of tokens doesn’t capture this structure. We need something that explicitly shows: “The multiplication is nested inside the addition.”

This is where syntax trees comes into play,

What is a Syntax Tree?

A syntax tree is exactly what it sounds like: a tree-shaped data structure that represents the grammatical structure of your code.

lets take 1 + 2 * 3 as an example, break it down and build a syntax tree:

       [+]
      /   \
    [1]   [*]
          /  \
        [2]  [3]

See how the multiplication is a subtree of the addition? This explicitly shows that we need to evaluate 2 * 3 first, then add the result to 1.

Why Trees Work So Well

Trees are perfect for representing code because:

They show nesting: Operations within operations, expressions within expressions
They enforce order: Child nodes must be evaluated before parent nodes
They’re recursive: An expression can contain other expressions, which can contain other expressions, and so on…
They’re unambiguous: There’s exactly one way to interpret the structure

But First: What Is a Grammar?

Before we can build trees, we need to understand grammars, the rules that define what valid code looks like.

The Grammar of Email Addresses

You know what a valid email address looks like, right? [email protected] is valid, but @[email protected] is gibberish. There are rules for what makes an email valid:

It must have a username part
Then an @ symbol
Then a domain name
The domain must have at least one dot
After the dot, there must be a domain extension

These are grammar rules. They define the structure of valid email addresses.

Programming languages have grammar rules too, but they’re much more complex than grammar rules for email addresses.

Differenr types of Grammars

Grammars exist on a spectrum from simple to complex.

Regular Grammars (What We Used for Scanning)

For our lexer used regular expressions and regular grammars. These are the simplest type, perfect for recognizing flat patterns.

A regular grammar can recognize things like:

Keywords: cast, enchant, manifest
Numbers: 123, 45.67
Identifiers: userName, dragon_age

But regular grammars have a limitation: they can’t count or match nested structures.

Think about matching parentheses: ((())). A regular grammar can’t ensure that the number of opening parens equals the number of closing parens. It can recognize individual parens, but it can’t count them.

This is a problem because code is full of nesting and nested structures

# We need to match these:
( 1 + ( 2 * ( 3 - 4 ) ) )
{ if (x) { while (y) { ... } } }

For this, we need something more powerful than regular grammars

Context-Free Grammars (CFGs)

Context-Free Grammars are one level up in power. They can handle nested structures, recursion, and keep counts.

A CFG consists of:

Terminals: The actual tokens from our lexer (like NUMBER, PLUS, IF)
Non-terminals: Placeholders for patterns (like expression, statement)
Production rules: Recipes that say how to build non-terminals from terminals and other non-terminals

Car Order Grammar

Let’s start with something fun and practicle before we dive into code. Here’s a grammar for describing a car order:

car    → trim "with" car "upgrades"
       | trim
       | engine

trim   → "sport" "package"
       | "luxury" "edition"
       | "offroad"

engine → "turbocharged" "inline four"
       | "twin turbo" "V6"
       | "electric" "motor"

Let me break this down:

car, trim, and engine are non-terminals (categories)
Things in quotes like "sport" and "package" are terminals (actual words)
The → means “can be made from”
The | means “or” (multiple choices)

Using these rules, we can generate valid car configurations:

Start with car, pick a rule:

car → trim "with" car "upgrades"

Now expand trim, choose:

trim → "luxury" "edition"

For the second car, choose:

car → engine

Then:

engine → "turbocharged" "inline four"

Final result: "luxury" "edition" "with" "turbocharged" "inline four" "upgrades"

That’s a valid car configuration, The grammar generated it by following the rules.

why did we choose the car twice ?

We choose car twice because of a powerful concept in computer science called recursion. It’s the secret trick that allows a short list of grammar rules to handle infinitely long descriptions.

Look at that first rule again:

car → trim "with" car "upgrades"

Notice, how car appears on the left side (the definition) and also on the right side (the ingredients). Because car is inside its own definition, you can keep nesting upgrades for as long as you want. How it looks in action

If you wanted a car with multiple upgrades, you could chain them together by choosing that first rule over and over again:

First choice: trim “with” car “upgrades”
Second choice (expand the inner car): trim “with” (trim “with” car “upgrades”) “upgrades”
Third choice (expand the inner car again): trim “with” (trim “with” (engine) “upgrades”) “upgrades”

Without that second car, we would only ever be allowed to pick exactly one trim or one engine, and our grammar would stop right there. Including car twice acts like a loop, letting us stack as many customizable options onto our vehicle as our budget (or the grammar) allows.

BNF: A Notation for Writing Grammars

When computer scientists write grammars, they use a notation called Backus-Naur Form (BNF). It’s just a standardized way to write production rules.

Here’s the BNF notation we’ll use:

rule_name  → symbols go here ;
          | or another option ;

Plus some convenient shortcuts:

* means “zero or more times”
+ means “one or more times”
? means “optional (zero or one time)”
( ) for grouping
| for alternatives

So instead of writing:

horsepower → "super"
| "super" "super"
| "super" "super" "super"
| ...

We can write:

horsepower → "super"+ ;

Much cleaner

A Grammar for Math Expressions

Now let’s apply this to actual code. Here’s a simple grammar for arithmetic expressions:

expression → literal
           | unary
           | binary
           | grouping ;

literal    → NUMBER | STRING | "true" | "false" | "nil" ;

grouping   → "(" expression ")" ;

unary      → ( "-" | "!" ) expression ;

binary     → expression operator expression ;

operator   → "==" | "!=" | "<" | "<=" | ">" | ">="
           | "+"  | "-"  | "*" | "/" ;

Let’s trace through how this generates 1 + 2 * 3:

Start with expression
Choose binary
binary → expression operator expression
First expression → literal → NUMBER → 1
operator → +
Second expression → binary (recursion!)
That binary → expression operator expression
→ NUMBER * NUMBER
→ 2 * 3

Final: 1 + 2 * 3

Notice how the grammar naturally creates a tree structure through recursion.

From Grammar to Trees : Abstract Syntax Trees (ASTs)

When we parse code using a grammar, we build a tree. But not just any tree, an Abstract Syntax Tree (AST).

What Makes It “Abstract”?

A concrete syntax tree (or parse tree) includes EVERY detail from the grammar, including punctuation, keywords, and grouping symbols.

An abstract syntax tree strips away the syntactic noise and keeps only what matters for meaning.

For example, parsing (1 + 2) with parentheses:

Concrete syntax tree:

     grouping
    /    |    \
   (  expression )
        |
      binary
     /   |   \
    1    +    2

Abstract syntax tree:

    +
   / \
  1   2

The parentheses did their job (enforcing grouping), so we don’t need them in the AST. The tree structure itself shows the grouping.

AST Example: Breaking Down `1 + 2 * 3`

Let’s build the complete AST for 1 + 2 * 3:

        Binary (+)
        /        \
    Literal(1)  Binary (*)
                /        \
            Literal(2)  Literal(3)

Each node is an expression type:

Binary expression at the root: the addition operation
- Left child: a Literal expression (the number 1)
- Right child: another Binary expression (the multiplication)
The multiplication Binary expression:
- Left child: Literal expression (the number 2)
- Right child: Literal expression (the number 3)

To evaluate this tree, we work from the bottom up (post-order traversal):

Evaluate Literal(2) → 2
Evaluate Literal(3) → 3
Evaluate Binary(*) with 2 and 3 → 6
Evaluate Literal(1) → 1
Evaluate Binary(+) with 1 and 6 → 7

The tree structure automatically enforces the correct order of operations.

Real World Analogy: ASTs Are Like Organizational Charts

Think of an AST like a company org chart:

           CEO
          /   \
    Marketing  Engineering
       |       /     |     \
    Social   Backend Frontend Mobile

The CEO is the “root” (top-level node)
Departments are “child nodes”
Teams are “leaf nodes”

To get work done, you start at the leaves (teams do actual work) and results flow upward. Same with ASTs: leaf nodes (literals) provide values, operators combine them, and results flow up to the root.

Implementing the AST: From Theory to Code

Now comes the fun part: actually designing the architecture to represent these trees in C++.

The Challenge: Many Expression Types

We need classes for each type of expression:

Binary: 1 + 2, x * y, a == b
Unary: -5, !true
Literal: 42, "hello", true
Grouping: (1 + 2) Each type has different data:
Binary has TWO operands (left and right) and an operator
Unary has ONE operand and an operator
Literal has just a value
Grouping has one sub-expression We could make one giant class with fields for everything, but that’s messy. Instead, we’ll use inheritance with a base Expr class and derived classes for each type.

The Expression Class Hierarchy Architecture

In C++, we structure this using:

Base Class: An abstract Expr class that defines the interface all expressions must follow

Derived Classes: Concrete classes like Binary, Unary, Literal, and Grouping that inherit from Expr

Each derived class is simple—just a container for its specific data. No methods, no logic. Just pure data structures with:

Member variables to hold the expression’s data
A constructor to initialize those variables
That’s it. For example:
Binary holds: a left expression pointer, an operator token, and a right expression pointer
Unary holds: an operator token and a right expression pointer
Literal holds: a value (could be a number, string, boolean, or nil)
Grouping holds: a single sub-expression pointer “Wait,” you might be thinking, “where’s all the behavior? Shouldn’t these classes DO something?”

This brings us to a fundamental problem in programming…

The Expression Problem: OOP vs Functional Programming

Here’s the dilemma: We have multiple types (Binary, Unary, Literal, Grouping) and we’ll need multiple operations on them (evaluate, print, analyze, type-check, etc.).

How do we organize this code?

The OOP Way

In object-oriented programming, you’d add methods to each class:

Each expression class would have methods like:

evaluate() - to calculate the result
print() - to display the tree
analyze() - for type checking
And so on… Pros: Easy to add new expression types just create a new class
Cons: Hard to add new operations you must modify ALL existing classes

Imagine you have 21 expression classes. Adding one new operation means opening all 21 files and adding a method to each. That’s a maintenance nightmare.

The Functional Way: Pattern Matching

In functional languages, you’d write separate functions that pattern match on the expression type:

Each operation is one function that handles all types:

One evaluate() function with cases for Binary, Unary, Literal, etc.
One print() function with cases for all types
One analyze() function with cases for all types Pros: Easy to add new operations—just write a new function
Cons: Hard to add new types—you must modify ALL existing functions

If you add a new expression type, you must update every single operation function. With dozens of operations, that’s equally painful.

The Problem

Neither approach makes it easy to add BOTH new types AND new operations. This is called the Expression Problem.

For our interpreter, we’ll face this head-on: we have a fixed set of expression types, but we’ll constantly add new operations (parsing, evaluating, type checking, optimizing, etc…)

We need a solution that lets us add operations easily without touching the expression classes.

The Visitor Pattern

The Visitor Pattern solves the expression problem (in OOP languages) by adding a layer of indirection. It lets us define new operations on our expression types without modifying the classes themselves.

Here’s the key insight: Instead of putting operation methods IN the expression classes, we pass the expressions TO separate operation classes.

The Pattern Architecture in C++

The Visitor pattern in C++ involves a clever dance between two hierarchies:

1. The Expression Hierarchy (what we’re operating on):

Base Expr class
Derived classes: Binary, Unary, Literal, Grouping 2. The Visitor Hierarchy (the operations):
Base Visitor interface (template class)
Derived visitor classes: Evaluator, Printer, TypeChecker, etc.

How The Pieces Fit Together

The Visitor Interface defines a pure virtual method for each expression type:

visitBinaryExpr() - handles Binary expressions
visitUnaryExpr() - handles Unary expressions
visitLiteralExpr() - handles Literal expressions
visitGroupingExpr() - handles Grouping expressions The visitor is templated on a return type R, so different operations can return different types:
The printer returns strings
The evaluator returns values (numbers, booleans, etc.)
The type checker returns type information The Expression Classes each have ONE method: accept()
This method takes a visitor as a parameter
It calls the appropriate visit method on that visitor
It passes itself (this) to the visitor The Concrete Visitors implement the interface:
Each one is a complete operation
It has specific logic for handling each expression type
All the code for one operation lives in one class

The Double Dispatch Magic

Here’s where it gets clever. When you want to perform an operation on an expression, this happens:

First Dispatch: You call expression->accept(visitor)
- C++ polymorphism routes this to the correct derived class’s accept() method
- A Binary expression’s accept runs, or a Unary’s, based on the actual type
Second Dispatch: That accept() method calls back to the visitor
- Binary’s accept calls visitor->visitBinaryExpr(this)
- Unary’s accept calls visitor->visitUnaryExpr(this)
- The visitor now has the specific expression type Two dispatches, hence “double dispatch.” The first finds the right expression class, the second finds the right operation method.

Real World Analogy: Airport Security Screening

Think of airport security with different types of travelers and different screening procedures:

Travelers (Expressions):

Regular passengers
Crew members
Passengers with special needs
First-class passengers Screening Procedures (Visitors):
Security check visitor
Baggage handling visitor
Boarding pass verification visitor When you (a traveler) go through security (accept a visitor):

The system identifies what TYPE of traveler you are (first dispatch)
Based on your type, you’re routed to the appropriate screening procedure (second dispatch)
The screening procedure knows exactly how to handle YOUR type of traveler Different screening procedures (visitors) can be added without changing traveler types. If tomorrow they add a “health screening visitor,” all traveler types automatically support it through their existing accept() method.

Why This Solves Our Problem

Remember the table from earlier? Rows are expression types, columns are operations.

Without visitors, adding a column (new operation) means:

Open 21 expression class files
Add a new method to each
Risk breaking existing code With visitors, adding a column (new operation) means:
Create ONE new visitor class
Implement the visit methods for each type
Expression classes remain untouched We’ve essentially flipped the problem on its head. Now adding operations is easy.

The Trade-Off

The visitor pattern isn’t free:

What we gain:

Easy to add new operations
All logic for one operation in one place
Clean separation of concerns
Type-safe dispatch to the right handler

What we give up:

More complex initial setup
Adding new expression types requires updating all visitors
Extra indirection can be harder to follow at first For a compiler/interpreter, this trade-off is perfect. Our expression types are relatively stable (they’re defined by the language grammar), but we constantly add new operations (parsing, evaluating, optimizing, etc.).

Meta-Programming

Writing all those expression classes by hand is tedious. For Binary, Unary, Literal, Grouping, that’s manageable. But by the time we’re done with our language, we’ll have 21+ expression types.

Each one follows the exact same pattern:

A class declaration
Member variables for its data
A constructor that initializes those variables
An accept() method that calls the right visitor method
Proper memory management (destructors in C++) Rather than type all that boilerplate repeatedly (and risk typos or inconsistencies), we’re going to use code generation—writing a program that writes programs.

The Generator Architecture

The idea is simple: create a small utility program that:

Takes a description of each expression type (name and fields)
Generates the C++ code for all the classes
Outputs the complete header file ready to use The description format is straightforward—for each expression type, we specify:

The class name (like “Binary” or “Unary”)
The list of fields with their types (like “left expression, operator token, right expression”)

What Gets Generated

For each expression type described, the generator creates:

Class Declaration: The derived class that inherits from Expr

Member Variables: Private or public members for all the data

Smart pointers for sub-expressions (using std::unique_ptr or std::shared_ptr)
Tokens for operators
Values for literals Constructor: Takes parameters for each field and initializes them
In C++, this often uses member initializer lists for efficiency
Handles transferring ownership of smart pointers Destructor: Properly cleans up resources
Though with smart pointers, this is often automatic
The rule of five/three considerations Accept Method: The visitor pattern implementation
Pure virtual in the base class
Overridden in each derived class to call the correct visitor method The Base Class: Also generated, including:
The pure virtual accept() method
Virtual destructor for proper cleanup
Any shared functionality The Visitor Interface: A template class with:
Pure virtual visit methods for each expression type
Template parameter for the return type
Virtual destructor

Why Generate Code?

DRY (Don’t Repeat Yourself): The pattern is identical across all 21 types
Consistency: Every class follows exactly the same structure
Easy updates: Adding a new expression type = one line in the generator’s input
Error prevention: No typos in 21 hand-written classes
Maintainability: Change the pattern once, regenerate, done

C++ Specific Considerations

The generator needs to handle C++ specifics that don’t exist in other languages:

Header Guards: Include guards or #pragma once to prevent multiple inclusion Forward Declarations: Sometimes needed to break circular dependencies Smart Pointers: Choosing between unique_ptr (exclusive ownership) and shared_ptr (shared ownership) Move Semantics: Generated constructors should support move operations for efficiency Const Correctness: Making appropriate methods and parameters const Namespace Management: Putting everything in the appropriate namespace

The Build Process

The workflow looks like this:

Write the generator once (it’s a small C++ program or script)
Run the generator whenever you add/modify expression types
The generator outputs Expr.h (and possibly Expr.cpp)
Your main project includes the generated files
Build the interpreter as usual This is meta-programming: using code to write code. It’s common in compiler development where patterns repeat extensively.

Putting It All Together: The AST Printer

Let’s put the Visitor pattern into action by creating something useful: an AST printer that shows the tree structure.

The Goal

Given this tree:

      Binary(*)
     /         \
 Unary(-)    Grouping
    |            |
Literal(123) Literal(45.67)

We want to print:

(* (- 123) (group 45.67))

It’s Lisp-style notation that makes the tree structure crystal clear. Each opening parenthesis shows a node in the tree, and the contents show its children.

The Architecture

The AstPrinter is our first concrete visitor class. It:

Implements the Visitor Interface: Specifically, Visitor<std::string> since we want to produce strings

Has Visit Methods: One for each expression type

visitBinaryExpr() - formats binary operations like (+ 1 2)
visitUnaryExpr() - formats unary operations like (- 5)
visitLiteralExpr() - formats literal values like 123 or "hello"
visitGroupingExpr() - formats grouped expressions like (group ...) Uses a Helper Method: A parenthesize() function that:
Takes a name/operator and a list of sub-expressions
Wraps them in parentheses: (name expr1 expr2 ...)
Recursively calls accept() on each sub-expression

How It Works: The Recursive Dance

When you call the printer on a tree, here’s the flow:

Entry Point: Call print(expression) on the root node
First Accept: The root expression’s accept() is called with the printer
Type Routing: Based on the actual expression type, the right accept() implementation runs
Visit Call: That accept() calls the corresponding visit method: visitBinaryExpr(this)
Recursive Processing: The visit method processes this node and recursively calls accept() on child expressions
Bottom-Up Building: Recursion unwinds, building the string from leaves to root
Final Result: The complete string representation

Example Walkthrough

Let’s trace through printing (* (- 123) (group 45.67)):

Step 1: Start at the root Binary(*) node

Call accept() on the Binary expression
Routes to visitBinaryExpr() Step 2: Visit method processes the Binary
Calls parenthesize("*", left_expr, right_expr)
Opens with (*
Now needs to process left child (the Unary) Step 3: Recursively call accept() on the Unary(-) node
Routes to visitUnaryExpr()
Calls parenthesize("-", right_expr)
Opens with (-
Needs to process its child Step 4: Recursively call accept() on Literal(123)
Routes to visitLiteralExpr()
Simply returns "123" (no children to recurse on) Step 5: Unary complete
Gets back "123" from its child
Completes: (- 123)
Returns this string Step 6: Back to the Binary’s right child
Call accept() on the Grouping node
Routes to visitGroupingExpr()
Calls parenthesize("group", inner_expr) Step 7: Recursively process Grouping’s child
Literal(45.67) returns "45.67"
Grouping completes: (group 45.67) Step 8: Binary complete
Has left result: (- 123)
Has right result: (group 45.67)
Combines: (* (- 123) (group 45.67)) Done. The recursive structure of the visitor naturally mirrors the recursive structure of the tree.

The Beauty of Recursion

Notice the elegant symmetry:

Trees are recursive: Expressions contain sub-expressions contain sub-expressions…

Visitors are recursive: Visit methods call accept on sub-expressions, which call visit methods…

This mutual recursion is the heart of the visitor pattern. Each node handles its own printing by delegating to its children. No node needs to know the complete tree structure—just its immediate children.

It’s like a chain of command: the general doesn’t need to know every soldier’s name. The general commands the colonels, who command the majors, who command the captains, and so on. Each level handles its own responsibility and delegates downward.

Using the Printer

To use the printer in practice:

Build a tree (manually for testing, or from the parser later)
Create a printer instance: AstPrinter printer;
Call print: std::string result = printer.print(expression);
The result: A string showing the complete tree structure

This printer is invaluable for debugging. When your parser generates a tree, print it to verify it has the structure you expect. When expressions evaluate wrong, print the tree to see if the precedence is correct.

Beyond Printing: The Visitor Template

The printer demonstrates the visitor pattern, but remember: this pattern is a template for all operations.

Want to evaluate expressions? Create an Evaluator visitor:

visitBinaryExpr() evaluates both operands and applies the operator
visitUnaryExpr() evaluates the operand and applies the operator
visitLiteralExpr() returns the literal value
Same recursive structure, different logic. Want type checking? Create a TypeChecker visitor:
visitBinaryExpr() checks that operands have compatible types
visitUnaryExpr() checks the operand matches the operator’s requirements
Same pattern, different purpose. Every operation follows the same architectural pattern. Master it once, apply it everywhere.

Demo

A sample demo of the AST printer that we made.

input: -123 * (45.67)

Looking Ahead

We now have the data structures to represent code as trees. But we haven’t actually built those trees yet. That’s the job of the parser, which we’ll build in the next chapter.

The parser will:

Take our flat sequence of tokens
Use the grammar rules to understand structure
Build the AST for us using our expression classes Then, once we have the tree, the interpreter will walk it (using another visitor) and execute the code.

We’re assembling the pieces of the puzzle:

We’ve finished the lexer and defined the AST structure. Next up: the parser, where all this grammar theory gets put into practice.

In the next post, we’ll build the parser that creates these beautiful trees from our tokens. Get ready for some serious grammar gymnastics and the magic of recursive descent parsing.

Tags: /wyrdlang/ /interpreter/ /crafting-interpreters/ /cpp/ /c++/ /programming-language/ /ast/ /meta-programming/ /visitor-pattern/ /data-structures/ /tree/