Assignment 2
------------

Assigned:	Tuesday, January 17
Due:		Tuesday, January 31


PROGRAMMING LANGUAGES

    A programming language is a set of strings.  A (formal) parser is a
    tool that takes in a string and returns a TRUE or a FALSE depending
    on whether the string is in the language or not.  The effort that
    the parser has to go through is an indication of the complexity of
    the language.  We will be concerned with two classes of formal
    languages:  the regular languages and the context free languages.


REGULAR LANGUAGES AND LEXICAL ANALYSIS

    A regular language is one for which a finite state machine can
    serve as a parser.  Parsers based on finite state machines are
    usually called lexical analyzers or lexers.  Lexers can process
    their inputs in a time proportional to the length of the input
    O(n).
    
    
CONTEXT FREE LANGUAGES AND SYNTACTIC ANALYSIS

    Lexical analyzers are not powerful enough to handle all of
    the constructs typically found in a programming language.  These
    features normally place the language in the class of context free
    languages.  Tools for analyzing context free languages are
    typically called syntactic analyzers (or just parsers).  In the
    most general case, an analyzer for a context free language can be
    built using a finite state machine together with a stack.  These
    machines are called non-deterministic pushdown automata, and they
    can process their input in time proportional to the cube of the
    length of the input O(n**3).


DIVIDING THE WORK

    Designing a parser for a programming language means dealing with
    the following tradeoff:  lexical analyzers are fast but weak;
    syntactic analyzers are slow but more powerful.  The tradeoff that
    has emerged over the last forty years has two aspects.

    + Lexical analyzers are responsible for dealing with "tokens"
      (the words and punctuation of the programming language) and
      syntactic analyzers are responsible for dealing with "phrases"
      (the legal ways in which tokens can be combined).  Typical
      classes of tokens comprise comments, whitespace, identifiers,
      literal constants (e.g.  numbers and strings), operators,
      punctuation and keywords.  Syntactic analyzers deal with
      higher-level constructs such as expressions, statements,
      declarations, and programs.

    + The programming languages themselves are normally restricted to
      be a deterministic subset of the full context free languages.
      Examples of useful subsets are the LL, LR, SLR, and LALR
      subsets.  Although some expressive power is given up when using
      one of these subsets, the advantage is that when using them,
      parsing can be accomplished in linear time O(n) and no complex
      backtracking is required.


OTHER TASKS

    Getting a TRUE/FALSE answer from a parser is not actually all that
    useful.  There are other tasks that parsers need to perform:

    + Error detection - telling the programmer where in the program
      (input string) the parser went astray.

    + Error correction - allowing the parser to go on and try to parse
      more of an erroneous program.  This way the programmer can learn
      more from a single run of the parser.  This is a difficult
      problem and no exact solution exists.

    + Lexical evaluation - normally lexical analyzers not only
      determine that a literal constant obeys the programming language
      rules, but they also process the constant to produce a value that
      can be used at runtime.  For example, integer constants are
      converted to binary integers, real constants are converted to
      binary floating point or fixed point numbers, and escape
      characters are replaced in strings by the actual ASCII character
      code.  For identifiers, a symbol table entry is constructed so
      that attributes of the corresponding program variable can be
      assembled.  For punctuation and keywords, an indication is
      returned giving the type of token.  Comments and whitespace
      are normally not passed on by the lexical analyzer.

    + Parse tree construction - many parsers not only perform the tasks
      listed above, but also construct a data structure usable by the
      other components of a compiler.  For lexical analysis, this data
      structure is normally a stream--that is, every time the analyzer
      is called, it returns the "next" token in the input stream along
      with its processed value (binary equivalent or symbol table
      entry).  For syntactic analyzers, the data structure often takes
      the form of a "parse tree" (a tree whose branching matches that
      of the structure of the programming language but which has been
      instantiated with the actual parsed contents of the input
      program).


PARSING TOOLS

    By now, you should be able to guess that parsing is a well-understood task.
    In fact, it is usually the case that parsers are no longer programmed by
    hand.  Instead, "analyzer generator" tools are used to construct them.
    For example, the "lex" tool is a lexical-analyzer-generator, and "yacc"
    is a "syntactic-analyzer-generator".  You give these tools a description
    of a programming language, and they automatically construct a lexer
    or a parser, respectively.

    Just as general lexical and syntactic analyzers do more than parse,
    analyzer generator tools also allow their users to describe how
    and when to perform other activities such as those described above.
    They normally do this by providing a way for their users to append
    instructions in a programming language (C for lex and yacc).


GRAMMARS

    Of course, the question then arises, how do you describe the
    programming language to be parsed.  A grammar is a description of a
    formal language.  Just as languages have classes, so too do grammars.
    For example, a context-free grammar describes a context-free language.
    And over the last forty years the computer science community has
    converged on two decisions as to the form that grammars should take
    in describing programming languages:

    + Lexical analyzers will be described using regular expressions.

    + Syntactic analyzers will be described using some form of BNF.


REGULAR EXPRESSION

    Regular expressions are formed from constants and three operations:
    concatenation, alternation, and iteration.  Practical lexical
    analyzer generators, like lex, add other features to the notation.
    Please see the lex manual for details.


BNF

    The syntax of programming languages is normally described using
    extended-BNF (EBNF).  "BNF" stands for either "Backus Normal Form"
    or "Backus-Naur Form".  John Backus originally described the
    notation, and Peter Naur adapted it for use in the Algol-60
    report.  Both scientists were members of the Algol committee.  The
    notation has since been extended to make it more useful to describe
    context free languages.

    BNF consists of two categories of constants: terminals and
    non-terminals.  Terminals denote the actual tokens of the
    language being described.  Non-terminals describe grammatical
    categories.  BNF also features catenation, alternation, and
    iteration.


--------------------------------------------------------------------------------

ASSIGNMENT

    Your assignment is to build a lexical analyzer and a syntactic
    analyzer for EBNF.  You should use generator tools such as lex and
    yacc.  (Many others exist such as ANTLR and JAVACC.) You are
    welcome to chose your own weapon, but lex and yacc are the only
    tools that I can guarantee are up to date, have adequate
    documentation and have been installed correctly.  I believe that
    ANTLR combines lexical and syntactic analysis in one tool, so
    please adapt the following requirements accordingly.


PART I - Two points

    The first step asks you to construct a grammar for EBNF and to
    describe the tokens for it.  Turn in your grammar and token
    description.


PART II - Three points

    Compose lexical rules for EBNF and build a lexical analyzer for
    it.  The analyzer should provide a stream of tokens to the
    syntactic analyzer.  Turn in your lex input file (.l file).


PART III - Three points

    Compose grammatical rules for EBNF and build a syntactic analyzer
    for it.  The analyzer should call the lexical analyzer to obtain
    tokens.  It should indicate that a parse was successful or if
    it was in error.  If it was successful it should print out a trace
    of the reductions (syntactic matches) it performed.  If it was
    unsuccessful, it should indicate where in the input the failure
    occurred.  Note that you are not required to construct a parse
    tree.  Turn in your yacc input file (.y file).


PART IV - Two points

    Compose a series of tests for your analyzers.  Test each tool
    separately and together.  Your tests should be comprehensive enough
    to demonstrate the power of your tools and their ability to deal
    with erroneous input.  One of your tests should be to parse a 
    grammar for BNF. Another should be to parse a grammar for EBNF. 
    Turn in both the input (the tests themselves) and the output (the 
    results of running your tests).


HINT (but please do not use)

   This assignment originates from Georgia Tech and I urge you to work
   on it yourself and solve it on your own.