|
|
|||
![]() |
Department of Engineering |
| University of Cambridge > Engineering Department > computing help |
Contents People sometimes ask if there's a C++ library function that given a string like "2+(3*4)" will return its value as a number. There's isn't such a function, but writing one is possible. This document illustrates how it can be done in about 50 lines of code using flex and bison. These programs generate source code to parse text that's structured (a computer program, for example, or a configuration file, or a maths expression). They're not trivial to use, but they're less error-prone than hand-crafted alternatives. The minimum of details will be explained here - with luck you should be able to adapt the programs in small ways for your own purposes.
Before introducing the example, I'll briefly deal with parsing in general, and with "regular expressions".
If you've ever written a program that reads text lines that are in particular formats, then you've written a sort of parser. Compilers include more sophisticated parsers. In general the first stage in parsing is "lexical analysis" (or "scanning") - reading in a string of characters and outputing a stream of strings and tokens. In this phase white space is usually deal with (as are comments), and keywords are replaced by symbols. lex or flex produce code for a function that does this, called yylex().
The second phase builds these elements (often known as lexemes) into expressions or phrases and "makes sense" of them. Yacc ("yet another compiler-compiler") and bison (so named because it isn't a Yak) produce code for a function called yyparse() that does this.
Both phases need to detect errors. That's not so hard, but it's difficult to produce error reports that are useful to humans, which is where hand-crafted parsers sometimes have an edge over auto-generated ones.
Sophisticated parsers come in several types. You don't need to know about the technical details, but a little of the vocabulary is useful to know about, in case things go wrong. When a grammar is described as LR(1) the first L means that the input is read from Left to right - i.e, from the start of the file (or line) to the end. The R means that the parser produces right-most derivations. The alternative to this is L (left-most derivations)
The number in brackets after the LR indicates how far the parser needs to look ahead before being able to understand the token it's on. bison can only understand grammars where looking one token ahead removes ambiguities.
bison essentially produces LR(1) parsers. Not all grammars can be dealt with by such parsers ("C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities" - Meta-Compilation for C++)
Regular Expressions are used with flex. You can think of Regular Expressions as a more sophisticated form of wildcard characters. On the Unix command line, ? matches a single character, and * matches 0 or more characters, so if you wanted to list all the files whose names have an x as the 4th letter, you could type ls ???x*. Regular Expressions use a different notation and can express more complex patterns. Here's a list of the features used on this page
[0-9]+("."[0-9]+)?
This matches one or digits followed (optionally, hence the
final ?) by a "." and one or digits - i.e. it matches positive real numbers.
The aim of this example is to write a calculator that will
understand +, -, *, /, and ^ (power)
and give them the correct precedence, understanding brackets. Numbers can be
integers, or reals, and "scientific notation" can be used. Similar code to that shown here is on many web pages.
header
tokens
%%
rules
%%
user-defined functions
bison file format
We'll deal with the 2nd stage first.
Put the following into a specification file called calc.y
%{
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define YYSTYPE double
%}
%token NUMBER
%token PLUS MINUS TIMES DIVIDE POWER
%token LEFT RIGHT
%token END
%left PLUS MINUS
%left TIMES DIVIDE
%left NEG
%right POWER
%start Input
%%
Input:
| Input Line
;
Line:
END
| Expression END { printf("Result: %f\n", $1); }
;
Expression:
NUMBER { $$=$1; }
| Expression PLUS Expression { $$=$1+$3; }
| Expression MINUS Expression { $$=$1-$3; }
| Expression TIMES Expression { $$=$1*$3; }
| Expression DIVIDE Expression { $$=$1/$3; }
| MINUS Expression %prec NEG { $$=-$2; }
| Expression POWER Expression { $$=pow($1,$3); }
| LEFT Expression RIGHT { $$=$2; }
;
%%
int yyerror(char *s) {
printf("%s\n", s);
}
int main() {
if (yyparse())
fprintf(stderr, "Successful parsing.\n");
else
fprintf(stderr, "error found.\n");
}
When the specification file is processed using
bison -d calc.yit produces calc.tab.h and calc.tab.c. The '-d' makes bison produce the header file - flex needs the information (a shared list of tokens) in this.
header
regular expressions
%%
rules
%%
user-defined functions
flex file format
flex is what reads the input and processes it ready for bison
to use. flex and bison need to agree on some token
definitions - these are generated for lex in calc.tab.h.
Below is the definition file for flex. Put the code into a file called calc.lex
%{
#define YYSTYPE double
#include "calc.tab.h"
#include <stdlib.h>
%}
white [ \t]+
digit [0-9]
integer {digit}+
exponent [eE][+-]?{integer}
real {integer}("."{integer})?{exponent}?
%%
{white} { }
{real} { yylval=atof(yytext);
return NUMBER;
}
"+" return PLUS;
"-" return MINUS;
"*" return TIMES;
"/" return DIVIDE;
"^" return POWER;
"(" return LEFT;
")" return RIGHT;
"\n" return END;
Do
flex -o calc.lex.c calc.lexto produce calc.lex.c. At the end of this you should have the C source code for a simple calculator. The code isn't easy to read - it's not meant to be. To compile it do
gcc -o calc calc.lex.c calc.tab.c -lfl -lm
Then type
./calcand type in mathematical expressions. E.g. typing 4e2+(2^0.5*3/2) should give you 402.121320. Then try typing in 7+8) and see if you get a useful error message. Press Ctrl and C to quit.
calc: calc.y calc.lex bison -d calc.y flex -o calc.lex.c calc.lex gcc -o calc calc.lex.c calc.tab.c -lfl -lm(the indented lined must use tabs - spaces aren't good enough) then typing make will save you time whenever you want a new version of the program.
extern char* yytext;
int yyerror(char *s) {
printf("%s - %s\n", s, yytext);
}
extern char* yytext;
extern int yylineno;
int yyerror(char *s) {
printf("%s on line %d - %s\n", s, yylineno, yytext);
}
and put
%option yylinenoin calc.lex (in the section where white etc are defined) you should see the line numbers.
./calc < filenameif you have a text file called filename with a few lines of arithmetic expressions in it, but neater would be the ability to do
./calc filenameyylex() reads input from yyin so if you set this using
extern FILE* yyin;
yyin = fopen("filename", "r");
in main, it should work, but you can make it friendlier and more flexible.
flex and bison are the successors of lex and yacc, with which they're mostly back-compatible. They're available on Unix/Linux machines, often being installed by default. On Macs they're bundled in with Xcode. flex++ and bison++ exist for C++ users - though they can use flex and bison as above. There are C++ options in flex and bison but I've had trouble getting them to work.
| | computing help | |