Skip to content

Latest commit

 

History

History

lexer

README file for Programming Assignment 2 (C++ edition)

Your directory should contain the following files:

Makefile README cool.flex test.cl lextest.cc -> [cool root]/src/PA2/lextest.cc mycoolc -> [cool root]/PA2/mycoolc stringtab.cc -> [cool root]/PA2/stringtab.cc utilities.cc -> [cool root]/PA2/utilities.cc handle_flags.cc -> [cool root]/PA2/handle_flags.cc *.d dependency files . other generated files

The include (.h) files for this assignment can be found in [cool root]/lexer

The Makefile contains targets for compiling and running your
program. DO NOT MODIFY.

The README contains this info. Part of the assignment is to fill
the README with the write-up for your project. You should
explain design decisions, explain why your code is correct, and
why your test cases are adequate. It is part of the assignment
to clearly and concisely explain things in text as well as to
comment your code. Just edit this file.

cool.flex is a skeleton file for the specification of the
lexical analyzer. You should complete it with your regular
expressions, patterns and actions. 

test.cl is a COOL program that you can test the lexical
analyzer on. It contains some errors, so it won't compile with
coolc. However, test.cl does not exercise all lexical
constructs of COOL and part of your assignment is to rewrite
test.cl with a complete set of tests for your lexical analyzer.

cool-parse.h contains definitions that are used by almost all parts
of the compiler. DO NOT MODIFY.

stringtab.{cc|h} and stringtab_functions.h contains functions
    to manipulate the string tables.  DO NOT MODIFY.

utilities.{cc|h} contains functions used by the main() part of
the lextest program. You may want to use the strdup() function
defined in here. Remember that you should not print anything
from inside cool.flex! DO NOT MODIFY.

lextest.cc contains the main function which will call your
lexer and print out the tokens that it returns.  DO NOT MODIFY.

mycoolc is a shell script that glues together the phases of the
compiler using Unix pipes instead of statically linking code.  
While inefficient, this architecture makes it easy to mix and match
the components you write with those of the course compiler.
DO NOT MODIFY.	

    cool-lexer.cc is the scanner generated by flex from cool.flex.
    DO NOT MODIFY IT, as your changes will be overritten the next
    time you run flex.

The *.d files are automatically generated Makefiles that capture
dependencies between source and header files in this directory.
These files are updated automatically by Makefile; see the gmake
documentation for a detailed explanation.

Instructions

To compile your lextest program type:

% make lexer

Run your lexer by putting your test input in a file 'foo.cl' and
run the lextest program:

% ./lexer foo.cl

To run your lexer on the file test.cl type:

% make dotest

If you think your lexical analyzer is correct and behaves like
the one we wrote, you can actually try 'mycoolc' and see whether
it runs and produces correct code for any examples.
If your lexical analyzer behaves in an
unexpected manner, you may get errors anywhere, i.e. during
parsing, during semantic analysis, during code generation or
only when you run the produced code on spim. So beware.

If you change architectures you must issue

% make clean

when you switch from one type of machine to the other.
If at some point you get weird errors from the linker,	
you probably forgot this step.

---8<------8<------8<------8<---cut here---8<------8<------8<------8<---

Author Abdullah Emad

Lexer

The core implementation can be found in "cool.flex". The file starts by defining important regular expressions and naming them. Those regular expressions and their actions will be discussed in details one by one.

REGEX design notes

A case insensitive keyword K is represented as follows: (c0|C0)(c1|C1)....(cn|Cn) where ci is the ith lowercase letter of K and Ci is the ith upper letter of K. true and false are the only exception as they have to start with lower case letters.

DARROW

This represents the "=>" input token.

Action: Since no value associated with it, the action is to simply return the code associated with this token.

ASSIGN

This represents the "<-" input token.

Action: Returns the code associated with this token.

LE

This represents the less than or equal, "<=" input token.

Action: Returns the code associated with this token.

NOT

This represents the NOT input token.

Action: Returns the code associated with this token.

INTEGERS

This represents an integer in the code. An integer is the concatenation of one or more digit from zero to nine. Hence, [0-9]+

Action: Insert the value in the inttable, assign its relevant entry to cool_yyval.symbol and return the code associated with this token.

TBOOLEAN

This represents the boolean value of true. note that the true token must begin with a lower case t while the other letters can be of any form.

Action: set cool_yyval.boolean to true and return a boolean code.

FBOOLEAN

This represents the boolean value of false. note that the false token must begin with a lower case t while the other letters can be of any form.

Action: set cool_yyval.boolean to false and return a boolean code.

TYPEID

This represents a type id. A type ID must begin with an upper case letter concatenated with any letters and underscores.

Action: Add it to the stringtable, set its value and return the code.

OBJECTID

This represents an object id. An Object ID must begin with a lower case letter concatenated with any letters and underscores.

Action: Add it to the stringtable, set its value and return the code.

NEWLINE

A new line character

Action: increment the curr_lineno variable

SPECIALCHARACTER

any tabs, backspaces, etc...

Action: do nothing

STARTCOMMENT

represents a block comment opening "(*".

ACTION: skip all the following input, including any nested comments, until the matching closing *) is encountered, or EOF has been reached. If EOF was reached before seeing a closing tag, an ERROR token is returned.

CLOSECOMMENT

Represents a block comment closure "*)".

ACTION: return an ERROR token, because there is no way for this to match except if this is a dangling "*)" with no comment opening associated with it.

ONELINECOMMENT

Represents a single line comment "--". Matches any character after "--" until a new line is encountered. If no newline is encountered, this means this must be the last line in the file and EOF was reached and hence after trying to match any character concatenated with a new line, it will try to match any character.

ACTION: do nothing

STRING

Represents a cool string (array of characters). The following rules, encoded in the STRING regex, define how a valid cool string should be formated:

  1. They start and end with quotations
  2. characters in the string can be any except for \0 (The null terminator ASCII code)
  3. \n character is only allowed if it follows an escape character \
  4. \ character is only allowed if it precedes any other allowed character including itself

Any character not satisfying these criteria is invalid.

the associated regex, "(\\n|\[^\0]|[^\0\n\"\\])*", can be explained as follows:

It begins and ends with a quotation mark. In the middle it accepts any number of concatenations of the following characters:

  1. an escaped newline. i.e a new line preceded with a '' character
  2. Any possible escaped character except the null terminator. Any character preceded with a '' character.
  3. Any character That is not one of the following four: a null terminator, a new line, a double quotation, a back slack. Note that any of these characters except the null terminator is valid if it was escaped. Thus, if they don't match the first two cases, then it must be the case that they are not escaped and they should not match this case.

The following sections cover the possible format errors of a string.

Action: if the string is matched successfully with this regular expression, the next step is escaping all the escaped characters and represent them as a single ASCII character. The first thing it does is it replaces any character c preceded with a backslash with the character c, i.e it replace \c with c. The following characters are exceptions: n, b, f, t, 0. If any of these characters are escaped, the will be replaced by \n, \b, \f, \t, \0 where \n denotes a single character (a new line) and \b denotes a single character (a backslash) and so on. It then stores them, character by character, in a buffer of a fixed size. If the size of the string is too large for the buffer, it simply returns an Error token.

NULSTR

This is possibly a valid string that starts with a quotation and ends with a quotation; however, it contains a null character somewhere. Hence, it should report this as an error.

Action: return an ERROR token.

ESCAPEDNULSTR

This is possibly a valid string that starts with a quotation and ends with a quotation; however, it contains an escaped null character somewhere. Hence, it should report this as an error.

Action: return an ERROR token.

UNMATCHEDSTR

The construction of this regular expression depends on the fact that any valid string is matched with the STRING regex, otherwise, this won't work correctly. Because an invalid string, is one that contains a null character, an escaped null character or missing a closing quotation, the string will match with this regex only if it does not match with the above Regex given that the lexical analyzer always tries to match the largest string.

Action: return an ERROR token.

Keywords

Because the keywords have almost the same definition, they are all explained and listed in this section. A keyword is case insensitive that is represented as discussed in the design notes section.

  1. TBOOLEAN
  2. FBOOLEAN
  3. CLASS
  4. ELSE
  5. FI
  6. IF
  7. IN
  8. INHERITS
  9. LET
  10. LOOP
  11. POOL
  12. THEN
  13. WHILE
  14. CASE
  15. ESAC
  16. OF
  17. NEW
  18. ISVOID

Action: all of them simply returns the code of the keyword.

Single characters

All single characters are represented in this section

  1. {
  2. }
  3. (
  4. )
  5. :
  6. ;
  7. .
  8. ,
  9. @
  10. /
  11. ~
  12. <
  13. =

Action: simply return the ASCII code of each character

Testing the Lexical Analyzer

For testing, I run diff to compute the difference between running my lexer on a test file and running the sample lexer on the same text file. If the output of diff is empty, then my lexer passes the test. Otherwise, there is an error. To make things easier, I wrote a script that runs the diff command on both lexers and compares the output of diff. By default, the make file runs the diff command on the file "mytest.cl". The file can be changed by changing the value of the variable MYTEST

To run the test on "mytest.cl"

make mytest

To run the test on any different cool file (path/to/file.cl)

make mytest MYTEST=path/to/file.cl

To run the tests provided by the online class run:

make test

This command was run on all the sample programs provided in examples and it passed. Additionally, the mytest.cl is a comprehensive set of valid and invalid strings including user strings, integers, type identifiers, object identifiers, case insensitive keywords, comments, and white spaces. All the special cases I could think of. Additionally, it tests for invalid programs, such as:

  1. A string with no matching closing quotation, possibly because it contains a non-escaped new line character
  2. A string with a null terminator character in between.
  3. A string with an escaped null terminator character
  4. An EOF in the middle of the string
  5. An unmatched comment block
  6. A nested comment block with its nested comments unmatched
  7. A comment that has EOF
  8. Special Characters That are not part of the cool program