Assignment 2 ------------ Assigned: Tuesday, January 17 Due: Tuesday, January 31 PROGRAMMING LANGUAGES A programming language is a set of strings. A (formal) parser is a tool that takes in a string and returns a TRUE or a FALSE depending on whether the string is in the language or not. The effort that the parser has to go through is an indication of the complexity of the language. We will be concerned with two classes of formal languages: the regular languages and the context free languages. REGULAR LANGUAGES AND LEXICAL ANALYSIS A regular language is one for which a finite state machine can serve as a parser. Parsers based on finite state machines are usually called lexical analyzers or lexers. Lexers can process their inputs in a time proportional to the length of the input O(n). CONTEXT FREE LANGUAGES AND SYNTACTIC ANALYSIS Lexical analyzers are not powerful enough to handle all of the constructs typically found in a programming language. These features normally place the language in the class of context free languages. Tools for analyzing context free languages are typically called syntactic analyzers (or just parsers). In the most general case, an analyzer for a context free language can be built using a finite state machine together with a stack. These machines are called non-deterministic pushdown automata, and they can process their input in time proportional to the cube of the length of the input O(n**3). DIVIDING THE WORK Designing a parser for a programming language means dealing with the following tradeoff: lexical analyzers are fast but weak; syntactic analyzers are slow but more powerful. The tradeoff that has emerged over the last forty years has two aspects. + Lexical analyzers are responsible for dealing with "tokens" (the words and punctuation of the programming language) and syntactic analyzers are responsible for dealing with "phrases" (the legal ways in which tokens can be combined). Typical classes of tokens comprise comments, whitespace, identifiers, literal constants (e.g. numbers and strings), operators, punctuation and keywords. Syntactic analyzers deal with higher-level constructs such as expressions, statements, declarations, and programs. + The programming languages themselves are normally restricted to be a deterministic subset of the full context free languages. Examples of useful subsets are the LL, LR, SLR, and LALR subsets. Although some expressive power is given up when using one of these subsets, the advantage is that when using them, parsing can be accomplished in linear time O(n) and no complex backtracking is required. OTHER TASKS Getting a TRUE/FALSE answer from a parser is not actually all that useful. There are other tasks that parsers need to perform: + Error detection - telling the programmer where in the program (input string) the parser went astray. + Error correction - allowing the parser to go on and try to parse more of an erroneous program. This way the programmer can learn more from a single run of the parser. This is a difficult problem and no exact solution exists. + Lexical evaluation - normally lexical analyzers not only determine that a literal constant obeys the programming language rules, but they also process the constant to produce a value that can be used at runtime. For example, integer constants are converted to binary integers, real constants are converted to binary floating point or fixed point numbers, and escape characters are replaced in strings by the actual ASCII character code. For identifiers, a symbol table entry is constructed so that attributes of the corresponding program variable can be assembled. For punctuation and keywords, an indication is returned giving the type of token. Comments and whitespace are normally not passed on by the lexical analyzer. + Parse tree construction - many parsers not only perform the tasks listed above, but also construct a data structure usable by the other components of a compiler. For lexical analysis, this data structure is normally a stream--that is, every time the analyzer is called, it returns the "next" token in the input stream along with its processed value (binary equivalent or symbol table entry). For syntactic analyzers, the data structure often takes the form of a "parse tree" (a tree whose branching matches that of the structure of the programming language but which has been instantiated with the actual parsed contents of the input program). PARSING TOOLS By now, you should be able to guess that parsing is a well-understood task. In fact, it is usually the case that parsers are no longer programmed by hand. Instead, "analyzer generator" tools are used to construct them. For example, the "lex" tool is a lexical-analyzer-generator, and "yacc" is a "syntactic-analyzer-generator". You give these tools a description of a programming language, and they automatically construct a lexer or a parser, respectively. Just as general lexical and syntactic analyzers do more than parse, analyzer generator tools also allow their users to describe how and when to perform other activities such as those described above. They normally do this by providing a way for their users to append instructions in a programming language (C for lex and yacc). GRAMMARS Of course, the question then arises, how do you describe the programming language to be parsed. A grammar is a description of a formal language. Just as languages have classes, so too do grammars. For example, a context-free grammar describes a context-free language. And over the last forty years the computer science community has converged on two decisions as to the form that grammars should take in describing programming languages: + Lexical analyzers will be described using regular expressions. + Syntactic analyzers will be described using some form of BNF. REGULAR EXPRESSION Regular expressions are formed from constants and three operations: concatenation, alternation, and iteration. Practical lexical analyzer generators, like lex, add other features to the notation. Please see the lex manual for details. BNF The syntax of programming languages is normally described using extended-BNF (EBNF). "BNF" stands for either "Backus Normal Form" or "Backus-Naur Form". John Backus originally described the notation, and Peter Naur adapted it for use in the Algol-60 report. Both scientists were members of the Algol committee. The notation has since been extended to make it more useful to describe context free languages. BNF consists of two categories of constants: terminals and non-terminals. Terminals denote the actual tokens of the language being described. Non-terminals describe grammatical categories. BNF also features catenation, alternation, and iteration. -------------------------------------------------------------------------------- ASSIGNMENT Your assignment is to build a lexical analyzer and a syntactic analyzer for EBNF. You should use generator tools such as lex and yacc. (Many others exist such as ANTLR and JAVACC.) You are welcome to chose your own weapon, but lex and yacc are the only tools that I can guarantee are up to date, have adequate documentation and have been installed correctly. I believe that ANTLR combines lexical and syntactic analysis in one tool, so please adapt the following requirements accordingly. PART I - Two points The first step asks you to construct a grammar for EBNF and to describe the tokens for it. Turn in your grammar and token description. PART II - Three points Compose lexical rules for EBNF and build a lexical analyzer for it. The analyzer should provide a stream of tokens to the syntactic analyzer. Turn in your lex input file (.l file). PART III - Three points Compose grammatical rules for EBNF and build a syntactic analyzer for it. The analyzer should call the lexical analyzer to obtain tokens. It should indicate that a parse was successful or if it was in error. If it was successful it should print out a trace of the reductions (syntactic matches) it performed. If it was unsuccessful, it should indicate where in the input the failure occurred. Note that you are not required to construct a parse tree. Turn in your yacc input file (.y file). PART IV - Two points Compose a series of tests for your analyzers. Test each tool separately and together. Your tests should be comprehensive enough to demonstrate the power of your tools and their ability to deal with erroneous input. One of your tests should be to parse a grammar for BNF. Another should be to parse a grammar for EBNF. Turn in both the input (the tests themselves) and the output (the results of running your tests). HINT (but please do not use) This assignment originates from Georgia Tech and I urge you to work on it yourself and solve it on your own.