knowing how to build a lexer allows you to extend your applications and opens up a whole new world. You can write your own DSLs or your own language or just better separate symbols: in other words, it allows you to have more control over a string
what is a lexer?
from user input to code execution there are three main steps either for interpreters or compilers:
- lexical analysis (1)
- parsing (2)
- code generation (example: to machine code or bytecode)
- execution (3)
a lexer is a tool that performs lexical analysis
difference between lexical analysis and scanning
in the beginning, scanning and lexical analysis were two different steps but due to the increased speed of processors, they now refer to one and the same process and are used interchangeably.
what is scanning?
scanning means to pass over / scan a string character by character (char by char).
what is lexical analysis?
it is the process whereby the scanned characters are turned into lexemes
what is a lexeme?
a lexeme is a recognised piece of string
let us take:
ac
we might build a parser that outputs the following lexemes :
- lexeme 1: the
- lexeme 2: quick
- lexeme 3: brown
- lexeme 4: fox
now let us take a simili code:
for(i; i<arr.length; i++){ }
- lexeme 1: for
- lexeme 2: (
- lexeme 3: i
- lexeme 4: ;
- lexeme 5: i
- lexeme 6: <
- lexeme 7: arr
- lexeme 8: .
- lexeme 9: length
- lexeme 10: ;
- lexeme 11: i
- lexeme 12: ++
- lexeme 13: )
- lexeme 14: {
- lexeme 15: }
so our task for this post will be to build a program that can separate those pieces
[sc name=”lexer”]
what does a lexer needs as input?
a lexer needs two things : the source code and keywords
what is a keyword?
a keyword is a lexeme that has a special meaning to the lexer
normally words like print, from, to are known as keywords, however, symbols such as ( , { can also be considered as keywords
types of keywords
there are single character keywords e.g. whitespace
there are multi-char keywords such as print
ignoring keywords
there are cases where we might want to ignore keywords as in the case where they appear between ” ” or in comments
what are ids?
ids are user-defined names, like the names of namespaces, variables, classes and functions. functions in the standard library are only predefined ids. normally keyword names are banned from being used as ids
some observations
let us compare :
the quick brown fox
and
for(i; i<arr.length; i++){ }
for both of the strings, lexing could be carried out. the first case is simple
lexeme = ''
add characters to lexeme variable until next char is a white space
if next is a white space
print lexeme
lexeme = ''
but for the next case we observe a general rule. we break if the next character is a keyword be it a special word or symbol
lexeme = ''
add characters to lexeme variable until next char is a white space (single-char keyword)
if next is a white space single-char keyword
print lexeme
lexeme = ''
in reality we also add the condition of if the element itself is a keyword
special cases
now let us take the case of ++, + is a keyword and ++ too but how do we differenciate between them or ==?
that’s when we use flags to determine which context are we in
flags are just variables that act like switches e.g.
x = 0
if … : x = 1
if x == 1 : …
or use some checks !
writing the scanner
basically we just need a program that outputs a char at one time. in python it is easy :
string = 'i am coming'
for char in string:
print(char)
which outputs:
i
a
m
c
o
m
i
n
g
just some fancy modifications :
string = 'i am coming'
for i, char in enumerate(string): # i gives us index and char the element
print('char', str(i+1).rjust(3, ' '), ':', char)
outputting:
char 1 : i
char 2 :
char 3 : a
char 4 : m
char 5 :
char 6 : c
char 7 : o
char 8 : m
char 9 : i
char 10 : n
char 11 : g
writing the basic lexer
now that we can read a string char by char, we can now check if next char is a keyword
let us do it for the phrase:
the beautiful white moon
where white space is a keyword, referring to above :
lexeme = ''
add characters to lexeme variable until next char is a white space
if next is a white space
print lexeme
lexeme = ''
we implement something similar to it i.e checking if next char is a white space:
string = 'the beautiful white moon'
white_space = ' '
lexeme = ''
for i,char in enumerate(string):
lexeme += char # adding a char each time
if (i+1 < len(string)): # prevents error
if string[i+1] == white_space: # if next char == ' '
print(lexeme)
lexeme = ''
which outputs :
the
beautiful
white
moon is missing as there is no white space after it, we fix this edge case by printing the lexeme after the loop :
string = 'the beautiful white moon'
white_space = ' '
lexeme = ''
for i,char in enumerate(string):
lexeme += char # adding a char each time
if (i+1 < len(string)): # prevents error
if string[i+1] == white_space: # if next char == ' '
print(lexeme)
lexeme = ''
print(lexeme)
output :
the
beautiful
white
moon
fixing whitespace addition
now white space got added to our lexeme, to fix we just add a character if the current char is not a whitespace :
string = 'the beautiful white moon'
white_space = ' '
lexeme = ''
for i,char in enumerate(string):
if char != white_space:
lexeme += char # adding a char each time
if (i+1 < len(string)): # prevents error
if string[i+1] == white_space: # if next char == ' '
print(lexeme)
lexeme = ''
print(lexeme)
output :
the
beautiful
white
moon
running on a real piece of code
let us take this piece of java code taken from tutorialspoint:
public class Test {
// main function
public static void main(String args[]) {
int [] numbers = {10, 20, 30, 40, 50};
for(int x : numbers ) {
System.out.print( x );
System.out.print(",");
}
/* printing new line
*/
System.out.print("\n");
String [] names = {"James", "Larry", "Tom", "Lacy"};
for( String name : names ) {
System.out.print( name );
System.out.print(",");
}
}
}
our first task is to identify single-char and multi-char keywords
MULTI-CHAR KEYWORDS :
public, class, static, void, main, string, int, for
SINGLE-CHAR KEYWORD :
{ } ( ) ; [ ] : . \ ” *
we did not consider System as a keyword as in the language, System is the name of an id (user-defined name)
now we can feed those keywords to our lexer
preassumption on whitespace
since whitespace still separates between like public and class:
public class Test {
we’ll keep the if next == whitespace rule
source codes are just pieces of strings
we could have ignored newline was it not for single line comments
a first attempt:
string = '''
public class Test {
public static void main(String args[]) {
int [] numbers = {10, 20, 30, 40, 50};
// printing !
for(int x : numbers ) {
System.out.print( x );
System.out.print(",");
}
System.out.print("\n");
String [] names = {"James", "Larry", "Tom", "Lacy"};
/*
looping over
*/
for( String name : names ) {
System.out.print( name );
System.out.print(",");
}
}
}
'''
symbols = ['{', '}', '(', ')', '[', ']', '.', '"', '*', '\n', ':', ','] # single-char keywords
other_symbols = ['\\', '/*', '*/'] # multi-char keywords
keywords = ['public', 'class', 'void', 'main', 'String', 'int']
KEYWORDS = symbols + other_symbols + keywords
white_space = ' '
lexeme = ''
for i,char in enumerate(string):
if char != white_space:
lexeme += char # adding a char each time
if (i+1 < len(string)): # prevents error
if string[i+1] == white_space or string[i+1] in KEYWORDS or lexeme in KEYWORDS: # if next char == ' '
if lexeme != '':
print(lexeme.replace('\n', '<newline>'))
lexeme = ''
the added conditions test if the next char is a single-char keyword or if what we already have at hand after adding a char is a keyword, if so, we then check if our variable is not empty else we’ll print an empty lexeme. we also replaced newline by
the above code outputs in:
<newline>
public
class
Test
{
<newline>
<newline>
public
static
void
main
(
String
args
[
]
)
{
<newline>
int
[
]
numbers
=
{
10
,
20
,
30
,
40
,
50
}
;
<newline>
//
printing
!
<newline>
for
(
int
x
:
numbers
)
{
<newline>
System
.
out
.
print
(
x
)
;
<newline>
System
.
out
.
print
(
"
,
"
)
;
<newline>
}
<newline>
System
.
out
.
print
(
"
<newline>
"
)
;
<newline>
String
[
]
names
=
{
"
James
"
,
"
Larry
"
,
"
Tom
"
,
"
Lacy
"
}
;
<newline>
/
*
<newline>
looping
over
<newline>
*
/
<newline>
for
(
String
name
:
names
)
{
<newline>
System
.
out
.
print
(
name
)
;
<newline>
System
.
out
.
print
(
"
,
"
)
;
<newline>
}
<newline>
}
<newline>
}
see how it magically separates 10 and , System and . , String and name … all perfect but : we have a glitch
a glitch and the introduction the next step!
see at /* and */, it considered them as / then * and not as a single entity. that is because we defined * as a keyword, though we did not use it here but we use it in multiplication like : 12 * 12
so, our basic lexer has a confusion at this junction, we can for the time being tackle it like :
we modify only the loop :
for i,char in enumerate(string):
if char == '*':
if string[i-1] == '/':
lexeme += '/*'
elif string[i+1] == '/':
lexeme += '*/'
else:
lexeme += '*'
elif char == '/':
if string[i+1] != '*' and string[i-1] != '*':
lexeme += '/'
else:
continue
else:
if char != white_space:
lexeme += char # adding a char each time
if (i+1 < len(string)): # prevents error
if string[i+1] == white_space or string[i+1] in KEYWORDS or lexeme in KEYWORDS: # if next char == ' '
if lexeme != '':
print(lexeme.replace('\n', '<newline>'))
lexeme = ''
we just added some checks to make things clear with the result that now the comments are recognised as such :
/*
<newline>
looping
over
<newline>
*/
conclusion
we built a lexer by voluntarily leaving out regex given that some lookaheads is a breeze in py. we’ll move on next time to the tokeniser.
in this post, brevity was voluntarily left out, favouring a more complete approach, more programming than jargon usage!