Your directory should contain the following files:
Makefile README cool.flex test.cl lextest.cc -> [cool root]/src/PA2/lextest.cc mycoolc -> [cool root]/PA2/mycoolc stringtab.cc -> [cool root]/PA2/stringtab.cc utilities.cc -> [cool root]/PA2/utilities.cc handle_flags.cc -> [cool root]/PA2/handle_flags.cc *.d dependency files . other generated files
The include (.h) files for this assignment can be found in [cool root]/lexer
The Makefile contains targets for compiling and running your
program. DO NOT MODIFY.
The README contains this info. Part of the assignment is to fill
the README with the write-up for your project. You should
explain design decisions, explain why your code is correct, and
why your test cases are adequate. It is part of the assignment
to clearly and concisely explain things in text as well as to
comment your code. Just edit this file.
cool.flex is a skeleton file for the specification of the
lexical analyzer. You should complete it with your regular
expressions, patterns and actions.
test.cl is a COOL program that you can test the lexical
analyzer on. It contains some errors, so it won't compile with
coolc. However, test.cl does not exercise all lexical
constructs of COOL and part of your assignment is to rewrite
test.cl with a complete set of tests for your lexical analyzer.
cool-parse.h contains definitions that are used by almost all parts
of the compiler. DO NOT MODIFY.
stringtab.{cc|h} and stringtab_functions.h contains functions
to manipulate the string tables. DO NOT MODIFY.
utilities.{cc|h} contains functions used by the main() part of
the lextest program. You may want to use the strdup() function
defined in here. Remember that you should not print anything
from inside cool.flex! DO NOT MODIFY.
lextest.cc contains the main function which will call your
lexer and print out the tokens that it returns. DO NOT MODIFY.
mycoolc is a shell script that glues together the phases of the
compiler using Unix pipes instead of statically linking code.
While inefficient, this architecture makes it easy to mix and match
the components you write with those of the course compiler.
DO NOT MODIFY.
cool-lexer.cc is the scanner generated by flex from cool.flex.
DO NOT MODIFY IT, as your changes will be overritten the next
time you run flex.
The *.d files are automatically generated Makefiles that capture
dependencies between source and header files in this directory.
These files are updated automatically by Makefile; see the gmake
documentation for a detailed explanation.
To compile your lextest program type:
% make lexer
Run your lexer by putting your test input in a file 'foo.cl' and
run the lextest program:
% ./lexer foo.cl
To run your lexer on the file test.cl type:
% make dotest
If you think your lexical analyzer is correct and behaves like
the one we wrote, you can actually try 'mycoolc' and see whether
it runs and produces correct code for any examples.
If your lexical analyzer behaves in an
unexpected manner, you may get errors anywhere, i.e. during
parsing, during semantic analysis, during code generation or
only when you run the produced code on spim. So beware.
If you change architectures you must issue
% make clean
when you switch from one type of machine to the other.
If at some point you get weird errors from the linker,
you probably forgot this step.
---8<------8<------8<------8<---cut here---8<------8<------8<------8<---
Author Abdullah Emad
The core implementation can be found in "cool.flex". The file starts by defining important regular expressions and naming them. Those regular expressions and their actions will be discussed in details one by one.
REGEX design notes
A case insensitive keyword K is represented as follows: (c0|C0)(c1|C1)....(cn|Cn) where ci is the ith lowercase letter of K and Ci is the ith upper letter of K. true and false are the only exception as they have to start with lower case letters.
DARROW
This represents the "=>" input token.
Action: Since no value associated with it, the action is to simply return the code associated with this token.
ASSIGN
This represents the "<-" input token.
Action: Returns the code associated with this token.
LE
This represents the less than or equal, "<=" input token.
Action: Returns the code associated with this token.
NOT
This represents the NOT input token.
Action: Returns the code associated with this token.
INTEGERS
This represents an integer in the code. An integer is the concatenation of one or more digit from zero to nine. Hence, [0-9]+
Action: Insert the value in the inttable, assign its relevant entry to cool_yyval.symbol and return the code associated with this token.
TBOOLEAN
This represents the boolean value of true. note that the true token must begin with a lower case t while the other letters can be of any form.
Action: set cool_yyval.boolean to true and return a boolean code.
FBOOLEAN
This represents the boolean value of false. note that the false token must begin with a lower case t while the other letters can be of any form.
Action: set cool_yyval.boolean to false and return a boolean code.
TYPEID
This represents a type id. A type ID must begin with an upper case letter concatenated with any letters and underscores.
Action: Add it to the stringtable, set its value and return the code.
OBJECTID
This represents an object id. An Object ID must begin with a lower case letter concatenated with any letters and underscores.
Action: Add it to the stringtable, set its value and return the code.
NEWLINE
A new line character
Action: increment the curr_lineno variable
SPECIALCHARACTER
any tabs, backspaces, etc...
Action: do nothing
STARTCOMMENT
represents a block comment opening "(*".
ACTION: skip all the following input, including any nested comments, until the matching closing *) is encountered, or EOF has been reached. If EOF was reached before seeing a closing tag, an ERROR token is returned.
CLOSECOMMENT
Represents a block comment closure "*)".
ACTION: return an ERROR token, because there is no way for this to match except if this is a dangling "*)" with no comment opening associated with it.
ONELINECOMMENT
Represents a single line comment "--". Matches any character after "--" until a new line is encountered. If no newline is encountered, this means this must be the last line in the file and EOF was reached and hence after trying to match any character concatenated with a new line, it will try to match any character.
ACTION: do nothing
STRING
Represents a cool string (array of characters). The following rules, encoded in the STRING regex, define how a valid cool string should be formated:
- They start and end with quotations
- characters in the string can be any except for \0 (The null terminator ASCII code)
- \n character is only allowed if it follows an escape character \
- \ character is only allowed if it precedes any other allowed character including itself
Any character not satisfying these criteria is invalid.
the associated regex, "(\\n|\[^\0]|[^\0\n\"\\])*", can be explained as follows:
It begins and ends with a quotation mark. In the middle it accepts any number of concatenations of the following characters:
- an escaped newline. i.e a new line preceded with a '' character
- Any possible escaped character except the null terminator. Any character preceded with a '' character.
- Any character That is not one of the following four: a null terminator, a new line, a double quotation, a back slack. Note that any of these characters except the null terminator is valid if it was escaped. Thus, if they don't match the first two cases, then it must be the case that they are not escaped and they should not match this case.
The following sections cover the possible format errors of a string.
Action: if the string is matched successfully with this regular expression, the next step is escaping all the escaped characters and represent them as a single ASCII character. The first thing it does is it replaces any character c preceded with a backslash with the character c, i.e it replace \c with c. The following characters are exceptions: n, b, f, t, 0. If any of these characters are escaped, the will be replaced by \n, \b, \f, \t, \0 where \n denotes a single character (a new line) and \b denotes a single character (a backslash) and so on. It then stores them, character by character, in a buffer of a fixed size. If the size of the string is too large for the buffer, it simply returns an Error token.
NULSTR
This is possibly a valid string that starts with a quotation and ends with a quotation; however, it contains a null character somewhere. Hence, it should report this as an error.
Action: return an ERROR token.
ESCAPEDNULSTR
This is possibly a valid string that starts with a quotation and ends with a quotation; however, it contains an escaped null character somewhere. Hence, it should report this as an error.
Action: return an ERROR token.
UNMATCHEDSTR
The construction of this regular expression depends on the fact that any valid string is matched with the STRING regex, otherwise, this won't work correctly. Because an invalid string, is one that contains a null character, an escaped null character or missing a closing quotation, the string will match with this regex only if it does not match with the above Regex given that the lexical analyzer always tries to match the largest string.
Action: return an ERROR token.
Keywords
Because the keywords have almost the same definition, they are all explained and listed in this section. A keyword is case insensitive that is represented as discussed in the design notes section.
- TBOOLEAN
- FBOOLEAN
- CLASS
- ELSE
- FI
- IF
- IN
- INHERITS
- LET
- LOOP
- POOL
- THEN
- WHILE
- CASE
- ESAC
- OF
- NEW
- ISVOID
Action: all of them simply returns the code of the keyword.
Single characters
All single characters are represented in this section
- {
- }
- (
- )
- :
- ;
- .
- ,
- @
-
-
-
- /
- ~
- <
- =
Action: simply return the ASCII code of each character
Testing the Lexical Analyzer
For testing, I run diff to compute the difference between running my lexer on a test file and running the sample lexer on the same text file. If the output of diff is empty, then my lexer passes the test. Otherwise, there is an error. To make things easier, I wrote a script that runs the diff command on both lexers and compares the output of diff. By default, the make file runs the diff command on the file "mytest.cl". The file can be changed by changing the value of the variable MYTEST
To run the test on "mytest.cl"
make mytest
To run the test on any different cool file (path/to/file.cl)
make mytest MYTEST=path/to/file.cl
To run the tests provided by the online class run:
make test
This command was run on all the sample programs provided in examples and it passed. Additionally, the mytest.cl is a comprehensive set of valid and invalid strings including user strings, integers, type identifiers, object identifiers, case insensitive keywords, comments, and white spaces. All the special cases I could think of. Additionally, it tests for invalid programs, such as:
- A string with no matching closing quotation, possibly because it contains a non-escaped new line character
- A string with a null terminator character in between.
- A string with an escaped null terminator character
- An EOF in the middle of the string
- An unmatched comment block
- A nested comment block with its nested comments unmatched
- A comment that has EOF
- Special Characters That are not part of the cool program