../day2-the-lexical-analyzer

Published on: 2026-05-19

By Vishal Rashmika

Day 2: The Lexical Analyzer

Welcome back to our journey of building WyrdLang from scratch! Last time, we laid the groundwork understanding about what a programming language really is, why we’re building a tree-walk interpreter, and how we’ll follow Crafting Interpreters but with a twist (C++ instead of Java and replaced keywords in the WyrdLang).

Today, we write the first real piece of our interpreter: the lexical analyzer, also called a lexer or scanner. This is where raw source code, strings of characters gets transformed into something the rest of our interpreter can understand.

If the first day was about planning the expedition, today we’re actually starting the climb.

What Exactly Does a Lexer Do?

When we write code, the computer initially sees nothing but a long, continuous string of characters. It doesn’t inherently know where one idea ends and the next begins. The Lexer’s job is to scan this string and chop it into Tokens.

The “Token” System: The Lexer acts like a high-speed sorter, labeling every scrap of text it encounters based on its function:

Keywords: It sees if or return and marks them as “Structural Commands.”
Identifiers: It sees user_score or total and marks them as “Names/Labels.”
Operators: It sees + or * and marks them as “Mathematical Actions.”
Literals: It sees 42 or “Hello” and marks them as “Raw Data.”
Punctuation: It sees { or ; and marks them as “Boundaries.”

Imagine the Lexer encounters this line of code:

x = 10 + y;

It doesn’t just “read” it; it dissects it into a sequence like this:

Character(s)	Lexer’s Classification (Token Type)
x	Identifier
=	Assignment Operator
10	Integer Literal
+	Addition Operator
y	Identifier
;	Statement Terminator

Why This Matters:

If you accidentally type 10abc, the Lexer is the first line of defense. It looks at its rulebook, realizes that a number followed immediately by letters doesn’t fit any known “Token” pattern, and throws a Lexical Error.By the time the Lexer is done, the messy text has been converted into a clean, organized stream of tokens, ready for the next stage, the Parser: to figure out what those tokens actually mean when put together.

Input

A string of source code, like this:

enchant health = 100;
cast health;

Output

A list of tokens – small, labeled chunks that each carry meaning:

# High level overview of the token format
[ENCHANT] [IDENTIFIER(health)] [EQUAL] [NUMBER(100)] [SEMICOLON]
[CAST] [IDENTIFIER(health)] [SEMICOLON] [EOF]

Each token remembers:

Type (what kind of thing it is: keyword, number, identifier, operator…)
Lexeme (the exact characters from the source)
Literal value (for numbers or strings, the actual value like 100 or "Hello")
Line number (so we can later point to exactly where an error happened)

The lexer doesn’t care about meaning or grammar, it just recognizes the pieces. It’s like sorting LEGO bricks by color and shape before you start building a something.

Output Format in our lexer used in WyrdLang

Type | Lexeme | Literal Value

Input:

enchant health = 100;
cast health;

Output

// For: enchant health = 100;
32 enchant  null     // Type 32: Keyword
40 health   null     // Type 40: Identifier
15 =        null     // Type 15: Operator
21 100      100      // Type 21: Number (Literal value is the integer 100)
12 ;        null     // Type 12: Terminator

// For: cast health;
33 cast     null     // Type 33: Keyword
40 health   null     // Type 40: Identifier
12 ;        null     // Type 12: Terminator
38          null     // Type 38: End of File (EOF)

The number reflected in the Type column actually reflects the enum values of the different token types we have defined in the header.

TokenType.h Header

#pragma once

enum TokenType {
    // Single-character tokens.
    LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
    COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,

    // One or two character tokens.
    BANG, BANG_EQUAL,
    EQUAL, EQUAL_EQUAL,
    GREATER, GREATER_EQUAL,
    LESS, LESS_EQUAL,
    
    // Literals.
    IDENTIFIER, STRING, NUMBER,
    
    // Keywords.
    TOGETHER, CLAN, OTHERWISE, NAY, SPELL, CYCLE, WHEN, EMPTINESS, EITHER,
    CAST, MANIFEST, ELDER, THINE, AYE, ENCHANT, ASLONGAS,
    END_OF_FILE
};

Why Can’t We Just Use Raw Text?

Why do we have to go through all this trouble? Why not feed the source code directly to the parser?

Two reasons:

1. Efficiency

The parser doesn’t want to worry about skipping whitespace, handling comments, or checking if = is part of ==. The lexer does all that boring work once, so the parser can focus on structure.

2. Simplicity

Tokenizing turns a messy string into a clean, uniform stream. It’s much easier to write a parser that expects EQUAL tokens than one that has to scan characters like =, !, = again and again.

Think of the lexer as your interpreter’s first filter. It removes the noise (whitespace, comments) and labels the signal (keywords, numbers, names).

The WyrdLang Tokens: Our Alphabet

Before we can scan, we need to know what we’re looking for. I defined a set of token types that covers everything WyrdLang can say.

Single-character tokens

( ) { } , . ; + - * /

These are easy: we see a (, we emit a LEFT_PAREN. We see a ;, we emit a SEMICOLON.

One-or-two character operators

! != = == > >= < <=

These need a little lookahead. When we see a =, we check the next character, if it’s another =, we emit EQUAL_EQUAL (the equality operator). If not, it’s just EQUAL (assignment).

Literals

IDENTIFIER – names like health, summonDragon, _temp
STRING – text inside double quotes, like "Hello, world!"
NUMBER – integers and decimals: 42, 3.14

Keywords

Remember our keyword mapping from Day 1? Here’s the complete list as token types:

WyrdLang Keyword	Token Type
enchant	ENCHANT
spell	SPELL
clan	CLAN
when	WHEN
otherwise	OTHERWISE
aslongas	ASLONGAS
cycle	CYCLE
manifest	MANIFEST
cast	CAST
aye	AYE
nay	NAY
emptiness	EMPTINESS
together	TOGETHER
either	EITHER
thine	THINE
elder	ELDER

Plus a special EOF token to mark the end of the file.

How the Lexer Works

Let me walk you through the scanning of a tiny WyrdLang program:

enchant x = 5;

I’ll simulate the lexer’s internal state. It keeps:

start – where the current token begins
current – where we are now
line – current line number (starts at 1)

Step 1: Read `'e'`

We see a letter. That means it could be a keyword or an identifier. We keep consuming characters while they are letters, digits, or underscores. We get "enchant".

Then we check our keyword table, yes, it’s a keyword! Emit ENCHANT.

Step 2: Skip whitespace

Space character, ignore, move on.

Step 3: Read `'x'`

Letter again. Consume until we hit a non-identifier character (the space after x).

"x" is not a keyword, so emit IDENTIFIER with lexeme "x".

Step 4: Skip whitespace

Space, ignore.

Step 5: Read `'='`

Single character =. Peek at next char, it’s a space, not another =. Emit EQUAL.

Step 6: Skip whitespace

Space, ignore.

Step 7: Read `'5'`

Digit. Consume all digits. No decimal point. Convert "5" to the number 5.0.

Emit NUMBER with literal value 5.0.

Step 8: Read `';'`

Single character ;. Emit SEMICOLON.

Step 9: End of file

No more characters. Emit EOF.

Final token list

[ENCHANT] [IDENTIFIER(x)] [EQUAL] [NUMBER(5.0)] [SEMICOLON] [EOF]

That wasn’t so hard, right? Now imagine doing this for thousands of lines of code, that’s what our lexer will do in milliseconds.

Handling the Tricky Parts

Not everything is as simple as enchant x = 5;. Real code has comments, strings, and errors. Let’s see how the lexer deals with them.

Comments

WyrdLang supports both single-line and multi-line comments.

// This is a comment – everything after // until the newline is ignored.

/* This is a multi-line comment */

We consume everything until */, even across newlines.

The lexer just skips them entirely, they never become tokens.

Strings

When we see a ", we enter string mode. We consume characters until the closing ". We also handle escape sequences like \n (newline) and \" (literal quote inside the string).

If we reach EOF without finding the closing quote, we report an error, because an unterminated string is a bug in the user’s code.

Numbers

We consume digits, then optionally a decimal point and more digits.

That’s it, no scientific notation, no hex, just simple numbers. We can always add more later.

Unknown characters

What if the source contains a @ or #? Those aren’t part of WyrdLang.

Our lexer will print an error like:

[line 3] Error: Unexpected character.

Then it skips the character and keeps scanning. This way, we can find multiple errors in one run, rather than stopping at the first mistake.

What the Lexer Does Not Do

It’s important to know where the lexer’s job ends.

The lexer does not:

Check that enchant is followed by an identifier (that’s parsing)
Make sure parentheses match (parsing)
Understand that 5 + "hello" makes no sense (that’s type checking, later)

The lexer is blissfully ignorant.

It would happily tokenize:

= = = ENCHANT 42 ENCHANT

into:

[EQUAL] [EQUAL] [EQUAL] [ENCHANT] [NUMBER] [ENCHANT]

Even though that’s total nonsense as a program.

That’s fine, the parser will catch the garbage later.

Writing the Lexer in C++

I’m following the jlox structure from Crafting Interpreters, but I’m writing it in C++.

Here’s the skeleton I built:

#pragma once

#include "TokenType.h"
#include "Token.h"
#include <string>
#include <unordered_map>
#include <vector>

class Scanner{
private:
    std::unordered_map<std::string, TokenType> keywords = {
        {"together",   TOGETHER},
        {"clan", CLAN},
        {"otherwise",  OTHERWISE},
        {"nay", NAY},
        {"cycle",   CYCLE},
        {"spell",   SPELL},
        {"when",    WHEN},
        {"emptiness",   EMPTINESS},
        {"either",    EITHER},
        {"cast", CAST},
        {"manifest",MANIFEST},
        {"elder", ELDER},
        {"thine",  THINE},
        {"aye",  AYE},
        {"enchant",   ENCHANT},
        {"aslongas", ASLONGAS}
    };
    
    std::string source;
    std::vector<Token> tokens;
    int start {0};
    int current {0};
    int line {1};

    void scanToken();
    void addToken(TokenType _type);
    void addToken(TokenType _type, std::any _literal);
    bool isAtEnd();
    bool isAlpha(char _character);
    bool isDigit(char _character);
    bool isAlphaNumeric(char _character);
    char peek();
    char peekNext();
    char advance();
    bool match(char _expected);
    void identifier();
    void number();
    void string();

public:
    Scanner(std::string _sourceData);
    std::vector<Token> scanTokens();
    
};

The scanTokens() method loops, calling scanToken() repeatedly until we run out of characters.

Each call to scanToken() looks at the current character and decides what to do, either a single-character token, or a more complex one like a string or number.

The keyword lookup is done with a hash map (unordered_map<string, TokenType>). When we read an identifier, we check the map. If found, it’s a keyword; otherwise, it’s a user-defined identifier.

Testing the Lexer – Does It Actually Work?

I wrote a few tests to make sure my lexer behaves.

Here’s the simplest one.

Input

cast "hello world";

Another test: operators and numbers

Input

10 + 20 * 3.5

Errors – Because They Will Happen

I deliberately fed the lexer some bad code to see how it reacts.

Unterminated string

Input

enchant broken = "Hello, world;

The lexer reaches the end of file without finding a closing quote, and continues – it doesn’t crash.

The token list will be incomplete (no STRING token), but the user gets a clear error.

Invalid character

Input

enchant weird = @#$%;

Then it skips it and keeps going.

The rest of the line tokenizes as best it can.

Collecting multiple errors in one pass is much friendlier than stopping at the first problem.

What’s Next?

With the lexer finished, we have a solid foundation. The rest of the interpreter will consume tokens, not raw characters. That means the parser (coming next) can focus entirely on grammar and structure.

In the next post, we’ll build the parser, the part that takes this flat list of tokens and assembles them into an Abstract Syntax Tree (AST). We’ll also start handling syntax errors, missing semicolons, unmatched parentheses, etc…

For now, the lexer works, and I’m one step closer to having a real programming language.

The Code

I’ve pushed everything to GitHub. you can find it here WyrdLang Feel free to poke around, try breaking it, and open issues if you find something weird.

Tags: /wyrdlang/ /interpreter/ /crafting-interpreters/ /cpp/ /c++/ /programming-language/

Day 2: The Lexical Analyzer

What Exactly Does a Lexer Do?

Input

Output

Output Format in our lexer used in WyrdLang

Real Input and Output formats related to our lexer

TokenType.h Header

Why Can’t We Just Use Raw Text?

1. Efficiency

2. Simplicity

The WyrdLang Tokens: Our Alphabet

Single-character tokens

One-or-two character operators

Literals

Keywords

How the Lexer Works

Step 1: Read 'e'

Step 2: Skip whitespace

Step 3: Read 'x'

Step 4: Skip whitespace

Step 5: Read '='

Step 6: Skip whitespace

Step 7: Read '5'

Step 8: Read ';'

Step 9: End of file

Final token list

Handling the Tricky Parts

Comments

Strings

Numbers

Unknown characters

What the Lexer Does Not Do

Writing the Lexer in C++

Testing the Lexer – Does It Actually Work?

Input

Another test: operators and numbers

Input

Errors – Because They Will Happen

Unterminated string

Input

Invalid character

Input

What’s Next?

The Code

Step 1: Read `'e'`

Step 3: Read `'x'`

Step 5: Read `'='`

Step 7: Read `'5'`

Step 8: Read `';'`