
An Introduction to Scripting Languages- Thomas Harte
![]()
It seems to me that one of the most asked about topics amongst moderate to advanced hobbyist programmers nowadays is scripting languages - specifically how to design and implement them. Although this is an advanced and complicated topic I will endeavour to summarise the basics in an attempt to leave the reader with a clear thinking on how to design a language and knowledge of the most simple manner in which to implement one in an interpreted fashion. For the sake of brevity, I intend to mercilessly skip as much theory as possible.
Although this document claims to be about scripting languages, I’ve actually lied since instead it is about procedural or imperative languages in general. So the joke’s on you, and you should probably see the appendix examples for how this sort of thing can be used for scripting.
Of course most of what I say can be implemented in any language, but this article uses C for all worked examples. If you’re an Ada, Pascal, Java or whatever fanatic then please don’t hate me.
From time to time this document refers to a couple of example programs, the source for which are available in this archive.
Tokens and Grammars
In general there are two basic sets of information required to define any language - the first is a list of tokens. Tokens for a particular language are meaningful sequences of characters. For example, in C while is a token, and if is also a token. You don’t necessarily know exactly what each token looks like - you might just know its general form. For example, I can describe and recognise the general form of an integer rather than stating all the possible integer numbers as different tokens.
The second piece of information that describes a language is the grammar. In this article I will deal exclusively with the context free class of grammars. Many languages have the tokens while, ( and ) and a rule stating that a valid use of these tokens (assuming we already know how to recognise an expression and statement) is :
while( <expression> ) <statement>
This is one grammar rule. Grammar rules are usually defined with arbitrary levels of recursion. Although it may not look like it, that rule may have an arbitrary level of recursion if we factor in the idea that such a rule is one possible statement.
The following sections will attempt to give you a more formal idea of how to define tokens and grammar rules, and hence how to handle them with actual code.
Tokenising [scanning]
The first task in meaningfully executing scripts in your made up language is to decompose the incoming sequences of characters into their tokens. Remember that tokens are a relatively high level notion and that you will typically have just one token to represent all variable identifiers, one to represent the presence of an integer constant, and so on. The module that achieves this is commonly known as a scanner.
Typically your scanner will look at the incoming characters, decide which token they are, and then make a note of that token by means of a unique number representing it. Of course, in C, enum is the perfect piece of syntax for this.
Therefore, for the hand coding programmer, this is really just an exercise in string matching and itself doesn’t warrant much discussion. However, even without going into the highly theoretical realms of finite state automata which the best manner of string matching in general, there are some non-obvious issues that need to be addressed.
The first is ambiguities. Returning again to the C language, & is a valid token representing bitwise and, but && is also a valid token representing logical and. So if your scanner reads one & character, should it immediately return since it already has a valid token? Of course not - most of the time your scanner will want to make sure it gets rid of the largest number of characters it can when generating each token. This extra piece of information is known as a disambiguating rule, and is just about the only disambiguating rule applicable to a scanner.
The second issue is one of general form. When you first attack the task of writing a scanner the best layout may not immediately grab you. Should you in one fell swoop go through the entire character input stream and make a corresponding stream of tokens? Maybe just one function that returns the next token?
In actual fact, there is a particularly useful form for scanners the origin of which is beyond the scope of this article. The first function you want to write gets the next token from the input stream and stores it in a global variable. This stored token is now known as the current token. I will refer to this function as GetToken.
A second function takes a token as an argument and if the argument is the same as the current token then the token is ‘accepted’ and a new token is read from the input. If the tokens are not the same an error occurs, and you can just print a message and exit or something. I’ll call this function Match.
Of course many times you will want to store some extra information about the current token than just which it is - e.g. in the case of an integer constant you might want to also know its value! If so then your GetToken must decode and store this information also.
Building an Example Language : The Scanner
For the purposes of demonstration I now propose the tokens for an example language. These tokens are the 26 lower case letters of the alphabet which are all considered to be identifiers, all positive integer numbers, the mathematical operators =, +, -, /, * and %, the comparison operators <, >, == and !=, the reserved words PUT and GET and also (, ), {, } and ;. All white space and carriage return / line feeds are ignored, provided they don’t break a multi-character token.
Of course there are near infinite ways to write a scanner for these tokens, but I’ve presented one example in appendix A which you might want to cast your gaze over at this stage. Its high dependency on having an open file to parse makes it less than optimal, but that isn’t the point.
Parsing : Grammar Rules
Parsing is the practice of looking at the incoming tokens, looking at the grammar rules and deciding which rules are being used at a particular point or alternatively that no rule is being followed, in which case you probably want to issue an error or something. Although I decided not to explain a standard form for describing tokens I will discuss a standard form for describing grammar rules since this form will have a massive influence not only on how you form your scripting language, but how you implement your parser also.
I intend to describe EBNF - Extended Backus-Naur form. It relies on the idea of terminals and productions or non-terminals (either name is valid). A terminal is just a token from your scanner. A non-terminal is a higher level analogue of a token. For example, you couldn’t form a particular token to represent all possible expressions but you can form a non-terminal for that purpose. A production is a grammar rule showing how one non-terminal can be seen to produce a particular ordered group of non-terminals and terminals. From now on I consider one production to be one rule.
Look at this :
while-statement -> while "(" expression ")" statement
That is EBNF syntax which states that while-statement is a non-terminal which produces everything on the right hand side of the arrow. It is assumed other rules already exist for expression and statement, and for the purpose of this document I’ve made clear that I mean bracket terminals rather than the brackets which are part of the EBNF syntax by encasing them in quotes. Of course while is a terminal.
I’ll sum up more rules of EBNF later, but for now be aware that the final thing you need to fully define your language is a start symbol. For example, consider a numeric calculator with the simple operations add and subtract. I might define :
expression -> number op expression | numberThere is one new piece of syntax in that example - the horizontal bar |, which represents choice. From the topmost rule, an expression may produce either number op expression or number. This is why EBNF itself uses brackets, to disambiguate certain choices, i.e. so if I combined the two previous rules into one it would make sense and be :
expression -> number (+ | -) expression | number
Rather than :
expression -> number + | - expression | number
Which doesn’t make sense. Anyway, returning to the point about start symbols, if I have the productions for expression and op how do I know which I’m supposed to be looking for first? Is the program ‘+’ valid since it is an op? The answer is, since I’ve already told you my language describes numeric operations, no it isn’t. Or better, I would have stated that the start symbol was expression, and could then easily show that + is not a valid expression.
Now, a quick summary of three other particularly useful rules of EBNF - curly brackets signify repetition, so for example :
number -> {digit}
Square brackets show optional parts, so I could rewrite the expression production as :
expression -> number [ op expression ]
Finally there is a special character - the epsilon (which I would like to be able to reproduce here but cannot, so will refer to explicitly by name) which represents the presence of no token. E.g.
else-statement -> else expression | epsilon
Means that the rule :
if-statement -> if "(" expression ")" statement else-statement
Happily matches an if statement with no else part.
You might be tempted to actually use number -> {digit} and similar rules to make your parser also do the work of a scanner and not have to write two bits of code. Although you could do this it isn’t recommended purely on the grounds that for achieving the same task a parser is quite a bit more expensive in terms of memory and processing overhead than a scanner so you should try to defer as much work from the former to the latter as is possible.
Parsing : Writing Code
The quickest and easiest way to write a parser from a given set of grammar rules in EBNF is to write what is known as a ‘recursive descent’ parser. This type of parser is best suited to hand coding (which is why it is rarely used nowadays) and has been used in the past for languages as complicated as C and Pascal. In order to write such a parser, you simply write an individual function to parse every rule.
Consider expression -> number [op expression]. Assuming your GetToken leaves the token id in the global variable ‘token’, here is a function that parses that rule :
void expression(void)
{
match(TOKEN_NUMBER);
if(token == TOKEN_PLUS || token == TOKEN_MINUS)
{
op();
expression();
}
}
Of course you will have some other function for op, and to start parsing you just need to call GetToken once to get the first token, then call the function related to your start symbol.
Parsing : And Actually Doing Something
So now you can write a parser for an arbitrary language! It will zip through your source and if there are no syntax errors eventually stop. Without actually doing anything. This is because your grammar rules state what is valid grammar, but not what the grammar actually does.
But consider this rewrite of expression (assuming your GetToken leaves the value of scanned numbers in tokenvalue) :
int expression(void)
{
int result = tokenvalue;
match(TOKEN_NUMBER);
switch(token)
{
case TOKEN_PLUS :
op();
result += expression();
break;
case TOKEN_MINUS :
op();
result -= expression();
break;
}
return result;
}
This code shows two important things - the first of which we have already seen, i.e. that grammar rules don’t actually have any computational meaning, they are just a way of finding out if a program is valid and when combined with the idea of a recursive descent parser making a convenient code base for adding meaning in an ad hoc manner.
Although this function isn’t very clever - it will produce ‘-2’ from ‘2 - 3 + 1’ - that sort of problem is easily avoided by picking your rules more carefully. I used the recursive nature of rules to signify repetition when I didn’t need to since we have the capability repetition already available in EBNF. A better rule is
expression -> number { (+ | -) number }
Which maps more sensibly to :
int expression(void)
{
int result = tokenvalue;
match(TOKEN_NUMBER);
while(token == TOKEN_PLUS || token == TOKEN_MINUS)
switch(token)
{
case TOKEN_PLUS :
match(TOKEN_PLUS);
result += tokenvalue;
match(TOKEN_NUMBER);
break;
case '-' :
match(TOKEN_MINUS);
result -= tokenvalue;
match(TOKEN_NUMBER);
break;
}
return result;
}
And produces the correct results!
A second observation at this point is that if op remains a separate function, my expression function still needs to know all of the tokens that signify an op. Suppose op were :
op -> ( + | - ) albert
Where albert is a terminal. The test in expression for token equal to + or - would still work. Hence expression wouldn’t need to know about the whole of op, just what is known as the first set - the set of all symbols that can come first and still potentially allow a valid op. You may find it useful to explicitly declare first sets.
But there are some pitfalls to recursive descent parsing with meaning when combined with the scanner form already presented - consider while statements again :
while-statement -> while "(" expression ")" statement
Clearly expression is going to be parsed many times, so you’ll need some way of returning to that point in the input stream before you match the token immediately before the expression (the left bracket in this case). Further, at some point you won’t want to execute statement so you need to be able to skip that section.
There is no standard form for doing either of these things, but again I must emphasise that adding meaning to your grammar rules is a highly language dependant thing and usually leads to specialised rather than general purpose code. The example scripting language presented later with example code in appendix D has both if and while statements if you’re really stuck.
Parsing : Continuing the Example Language
Remember the example scanner I presented? Go on, think hard. That’s the one! Now the grammar rules for the example parser shown in appendix B. The start token is ‘statement-list’.
statement-list -> { statement ; }Some example programs are shown in appendix C, although notice that the mathematical operators are given the ‘wrong’ ordering by the parser.
Combating Potential Grammar Problems
Remember expression -> number [op expression] ? Suppose you had written expression -> expression [op number] ? Well, that doesn’t work with a recursive descent parser since the first thing the expression routine will do is call the expression routine. But don’t fear - there is a work around : left recursion removal. Although the fix for this rule is obvious, I’ll discuss the general case anyway.
Suppose you have some arbitrary rule :
A -> A b | B
Simply correct to two rules :
A -> B A’And there is really only other one problem to avoid with recursive descent parsers. Consider this strange but valid version of the expression rule :
expression -> number + expression | number - expression | number
When your parser spots a number how does it know whether to match a - or a +? Of course it doesn’t, which is why you should left factor your rules, exactly as the repressed mathematician in you thinks you should. E.g.
expression -> number ( + expression | - expression | epsilon )
An Example Scripting Language
Returning to the original title of this document, I’ve devised a special language suitable for scripting of some computer controlled entities (I’ll call them characters) in a game type scenario. This language, and three scripts are within the example zip file as example 2.
The game attaches one instance of a script to every character. It also attaches a state - namely the characters location and 24 other temporary values.
For simplicity, beyond the state mentioned above, each character’s script is only additionally aware of the player’s location and no effort will be extended on frame compensation at the script level. Each character’s script will be executed once per frame.
For this purpose I have selected these tokens with explicit related strings :
; if ( ) { } else while == = + - < > != abs playxpos playypos rand
In addition to the single character ids and non-negative integer constants used before. It is worth noting that for my scripting languages, the ids ‘x’ and ‘y’ hold the character’s current location, and rand holds either 0 or 1 randomly each time it is read. My grammar rules are more useful this time, including selection and repetition :
if-statement -> if "(" right-expression ")" statement-list [ else statement-list ]statement-list is the start symbol. This language has some oddities compared to C - mostly related to a greater need for semicolons and a less flexible expression/statement distinction, but nevertheless allows quite some flexibility in movement patterns.
The full source for an example implementation of this language, a small game style engine in which it runs and three example scripts is included in the example zip file.
Some notes about that source :
Correct Precedence for Mathematical Operators
As I said, making your operators act correctly is just a matter of jiggling your grammar. For example :
exp -> exp addop term | termProduces an easy way to correctly parse mathematical statements with multiplies, adds and divides. Further consideration of this topic is left to the reader.
Code Generation
This article is geared towards writing your first language interpreter. Therefore, it necessarily isn’t interested in code generation. However, I thought I would provide a quick pointer. The basic idea is to build a tree of operations when you parse rather than actually carry out the operations. At the end a simple ordered traversal of the tree will enable code generation to machine code or byte-code or whatever.
Continued Study
If you’ve read and understood this article, possibly even implemented a language interpreter, then you’ve grasped the basics of the subject. Did you spot the key word in that sentence? I’ve left out so much I’m surprised anything is in this article at all. Some topics you should read further on :
scanning : regular expressions, LEX, finite state automata
parsing : context free grammats, EBNF, top-down and LL parsing, bottom-up and LR parsing, YACC
compiling : parse trees, syntax trees, attribute grammars, symbol tables, runtime environments
Goodbye
I can be found at T.Harte@excite.co.uk and thanks to Chris Barry for proof reading!
APPENDIX A : First Example scanner
#includeFILE *input = stdin; /* first some token definitions */ enum Tokens { TK_ID, TK_NUM, TK_ASSIGN, TK_PLUS, TK_MINUS, TK_DIV, TK_MULT, TK_MOD, TK_LT, TK_GT, TK_EQ, TK_NOTEQ, TK_GET, TK_PUT, TK_LBRAC, TK_RBRAC, TK_LCURLY, TK_RCURLY, TK_SEMI }; enum Tokens Token; int TokenValue; /* an error function */ void Error(char *str) { puts(str); exit(1); } /* now a get token function */ void GetToken(void) { int c; char *stcompare; /* used to decode the end of PUT or GET */ char get_str[] = "ET"; char put_str[] = "UT"; do c = getc(input); while(c == ' ' || c == '\t' || c == '\n' || c == '\r'); if(c == EOF) { exit(0); } if(c >= 'a' && c <= 'z') { Token = TK_ID; TokenValue = c - 'a'; return; } if(c >= '0' && c <= '9') { TokenValue = 0; Token = TK_NUM; while(c >= '0' && c <= '9') { TokenValue = (TokenValue * 10) + c - '0'; c = getc(input); } ungetc(c, input); return; } switch(c) { default : Error("Illegal character"); case '+' : Token = TK_PLUS; return; case '-' : Token = TK_MINUS; return; case '/' : Token = TK_DIV; return; case '*' : Token = TK_MULT; return; case '%' : Token = TK_MOD; return; case '<' : Token = TK_LT; return; case '>' : Token = TK_GT; return; case '(' : Token = TK_LBRAC; return; case ')' : Token = TK_RBRAC; return; case '{' : Token = TK_LCURLY; return; case '}' : Token = TK_RCURLY; return; case ';' : Token = TK_SEMI; return; case '!' : if(getc(input) == '=') { Token = TK_NOTEQ; return; } Error("Illegal character"); case '=' : /* Notice : disambiguating rule at work here, == is taken if possible, otherwise just = is assumed */ if((c = getc(input)) == '=') { Token = TK_EQ; return; } ungetc(c, input); Token = TK_ASSIGN; return; /* the next four are if/while/read/write all of which set a pointer to the intended string and optimistically set the token value, then leave the rest of gettoken to see if they were right */ case 'P' : Token = TK_PUT; stcompare = put_str; break; case 'G' : Token = TK_GET; stcompare = get_str; break; } while(*stcompare) { if(*stcompare != getc(input)) Error("bad token"); stcompare++; } } /* and the fairly generic match function */ void Match(enum Tokens tok) { if(tok == Token) GetToken(); else Error("Bad syntax"); }
Some notes :
APPENDIX B : Example Parser
/* some storage space */
int mem[26];
/* function definitions */
int PutStatement(void);
int GetStatement(void);
int MathStatement(int value);
int Statement(void);
void StatementList(void);
int PutStatement(void)
{
int c;
Match(TK_PUT);
printf("%d\n", c = Statement());
return c;
}
int GetStatement(void)
{
char tbuf[32];
Match(TK_GET);
printf("?");
scanf("%31s", tbuf);
return atoi(tbuf);
}
int MathStatement(int value)
{
switch(Token)
{
default : Error("Malformed mathematical statement");
case TK_PLUS : Match(TK_PLUS); return value + Statement();
case TK_MINUS : Match(TK_MINUS); return value - Statement();
case TK_DIV : Match(TK_DIV); return value / Statement();
case TK_MULT : Match(TK_MULT); return value * Statement();
case TK_MOD : Match(TK_MOD); return value % Statement();
}
}
int Statement(void)
{
int t;
switch(Token)
{
case TK_ID :
t = TokenValue;
Match(TK_ID);
switch(Token)
{
case TK_ASSIGN :
Match(TK_ASSIGN);
mem[t] = Statement();
if(!(Token == TK_PLUS || Token == TK_MINUS || Token == TK_DIV || Token == TK_MULT || Token == TK_MOD)) break;
case TK_PLUS :
case TK_MINUS :
case TK_DIV :
case TK_MULT :
case TK_MOD :
mem[t] = MathStatement(mem[t]);
default :
}
return mem[t];
case TK_NUM :
t = TokenValue;
Match(TK_NUM);
if(Token == TK_PLUS || Token == TK_MINUS || Token == TK_DIV || Token == TK_MULT || Token == TK_MOD)
{
t = MathStatement(t);
}
return t;
case TK_PUT : return PutStatement();
case TK_GET : return GetStatement();
}
}
void StatementList(void)
{
while(!feof(input))
{
Statement();
Match(TK_SEMI);
}
}
/*
MAIN PROC
*/
int main(int argc, char **argv)
{
if(argc > 1)
{
input = fopen(argv[1], "rt");
}
/* get the ball rolling */
GetToken();
/* start token is statement-list */
StatementList();
if(input)
fclose(input);
return 0;
}
APPENDIX C : An Example Program using the language in Appendices A and B
a = GET;This program will prompt the user for two numbers, then print the result of the first divided by the second, followed by the remainder.