-- CSCI 4500/6500 Programming Languages: Project 1 : Lex and Yacc Compiler Tools

Project 1: Project 1: Lex & Yacc - Compiler Tools

Assignment Day	January 25, 2012
Due Date	February 7, 2012 (about 2 weeks for both parts)

Collaboration Policy - Read Carefully

You must work on this project individually, but you may discuss this assignment with other students in the class and ask and provide help in useful ways, preferable over our email list so we can all benefit from your great ideas. You may consult (but not copy) any outside resources you including books, papers, web sites and people.

If you use resources other than the class materials, indicate what you used along with your answer.

Objective:

The main objective for this assignment, is of-course for you to familiarize yourself with the compiler tools Lex and Yacc (or Flex and Bison). Here are some hints on getting familiar with the tools.

Key Concepts:

Lex & Yacc Programming

Token Definitions

Grammar Generation and Recognition

Tutorial / References

Lex and Yacc Resources:

Compact Guide to Lex & Yacc

http://epaperpress.com/lexandyacc/index.html

Prof Kochut Compiler tools

http://www.cs.uga.edu/~kochut/Teaching/x570/tools/

Lex and Yacc runs on nike.cs.uga.edu and most UNIX machines.

Background: Overview of Lexical and Parsing Tools

A programming language is a set of strings. A (formal) parser is a tool that takes in a string and returns a TRUE or a FALSE depending on whether the string is in the language or not. The effort that the parser has to go through is an indication of the complexity of the language. We will be concerned with two classes of formal languages: the regular languages and the context free languages.

Regular Languages and Lexical Analysis

A regular language is one for which a finite state machine can serve as a parser. Parsers based on finite state machines are usually called lexical analyzers or lexers. Lexers can process their inputs in a time proportional to the length of the input O(n).

Context Free Languages and Syntactic Analysis

Lexical analyzers are not powerful enough to handle all of the constructs typically found in a programming language. These features normally place the language in the class of context free languages. Tools for analyzing context free languages are typically called syntactic analyzers (or just parsers). In the most general case, an analyzer for a context free language can be built using a finite state machine together with a stack. These machines are called non-deterministic pushdown automata, and they can process their input in time proportional to the cube of the length of the input O(n**3).

Dividing The Work

Designing a parser for a programming language means dealing with the following tradeoff: lexical analyzers are fast but weak; syntactic analyzers are slow but more powerful. The tradeoff that has emerged over the last forty years has two aspects.

Lexical analyzers are responsible for dealing with "tokens" (the words and punctuation of the programming language) and syntactic analyzers are responsible for dealing with "phrases" (the legal ways in which tokens can be combined). Typical classes of tokens comprise comments, whitespace, identifiers, literal constants (e.g. numbers and strings), operators, punctuation and keywords. Syntactic analyzers deal with higher-level constructs such as expressions, statements, declarations, and programs.
The programming languages themselves are normally restricted to be a deterministic subset of the full context free languages. Examples of useful subsets are the LL, LR, SLR, and LALR subsets. Although some expressive power is given up when using one of these subsets, the advantage is that when using them, parsing can be accomplished in linear time O(n) and no complex backtracking is required.

Other Tasks

Getting a TRUE/FALSE answer from a parser is not actually all that useful. There are other tasks that parsers need to perform:

Error detection - telling the programmer where in the program (input string) the parser went astray.
Error correction - allowing the parser to go on and try to parse more of an erroneous program. This way the programmer can learn more from a single run of the parser. This is a difficult problem and no exact solution exists.
Lexical evaluation - normally lexical analyzers not only determine that a literal constant obeys the programming language rules, but they also process the constant to produce a value that can be used at runtime. For example, integer constants are converted to binary integers, real constants are converted to binary floating point or fixed point numbers, and escape characters are replaced in strings by the actual ASCII character code. For identifiers, a symbol table entry is constructed so that attributes of the corresponding program variable can be assembled. For punctuation and keywords, an indication is returned giving the type of token. Comments and whitespace are normally not passed on by the lexical analyzer.
Parse tree construction - many parsers not only perform the tasks listed above, but also construct a data structure usable by the other components of a compiler. For lexical analysis, this data structure is normally a stream--that is, every time the analyzer is called, it returns the "next" token in the input stream along with its processed value (binary equivalent or symbol table entry). For syntactic analyzers, the data structure often takes the form of a "parse tree" (a tree whose branching matches that of the structure of the programming language but which has been instantiated with the actual parsed contents of the input program).

Parsing Tools

By now, you should be able to guess that parsing is a well-understood task. In fact, it is usually the case that parsers are no longer programmed by hand. Instead, "analyzer generator" tools are used to construct them. For example, the "lex" tool is a lexical-analyzer-generator, and "yacc" is a "syntactic-analyzer-generator". You give these tools a description of a programming language, and they automatically construct a lexer or a parser, respectively. Just as general lexical and syntactic analyzers do more than parse, analyzer generator tools also allow their users to describe how and when to perform other activities such as those described above. They normally do this by providing a way for their users to append instructions in a programming language (C for lex and yacc).

Grammars

Of course, the question then arises, how do you describe the programming language to be parsed. A grammar is a description of a formal language. Just as languages have classes, so too do grammars. For example, a context-free grammar describes a context-free language. And over the last forty years the computer science community has converged on two decisions as to the form that grammars should take in describing programming languages: + Lexical analyzers will be described using regular expressions. + Syntactic analyzers will be described using some form of BNF.

Regular Expressions

Regular expressions are formed from constants and three operations: concatenation, alternation, and iteration. Practical lexical analyzer generators, like lex, add other features to the notation. Please see the lex manual for details. BNF The syntax of programming languages is normally described using extended-BNF (EBNF). "BNF" stands for either "Backus Normal Form" or "Backus-Naur Form." John Backus originally described the notation, and Peter Naur adapted it for use in the Algol-60 report. Both scientists were members of the Algol committee. The notation has since been extended to make it more useful to describe context free languages. BNF consists of two categories of constants: terminals and non-terminals. Terminals denote the actual tokens of the language being described. Non-terminals describe grammatical categories. BNF also features catenation, alternation, and iteration.

Description

This assignment has two parts, the first part involves writing a simple HTML-to-TXT translator that reads from standard input text file and writes to standard output.

The second part embellishes the parser so that it enforces some simple grammatical rules.

In the first part your parser discarded all HTML tags and comments and write the text to a readable text document.

You will only need to be concerned with a simplified version of HTML -- but are encourage to embellish the project beyond the minimum requirements.

Part 1: Just Lexing:

HTML Tags:

The HTML standard defines a wide variety of tags. Since the goal of this assignment is to learn to use compiler front-end tools, we will approximate this with a much simpler definition:

A tag is a sequence of characters of the form <S>, where S is a character sequence that begins with a non-whitespace printable character and does not contain any ">" characters.

Printable characters are specified via the C library function isprint(); whitespace characters are specified via the C library function isspace(). They correspond to the flex character class expressions [:print:] and [:blank:] respectively (see man flex).

According to this definition, each of the following is a tag:

<b>
<br>
<a href="www.cs.uga.edu/~maria">
</b>
<bgh#i u&)by 168 jh>
<!@#$%~%^&*()_-+=][;"';:,.|>

HTML Comments

An HTML comment is a sequence of characters of the form

where S is a sequence of characters that does not contain the string -->. To simplify the project, we will assume that comments are always terminated, i.e., for every open-comment sequence . In other words, your translator should enforce the following requirement:

It is an error to have an end-of-file inside a comment.

Comments should be discarded.

Running Your Program

Your executable program will be called myhtml2txt. It will read input from stdin and write its output to stdout. Thus, to translate an HTML file foo.html to a text file bar.txt, invoke your program as

myhtml2txt < foo.html > bar.txt

Part 2: Lexing and Yaccing:

Functionality:

Your tool should have the following functionality. It should read its input from stdin, ensure that the input follows the grammar rules for our subset of HTML, discard all HTML tags and comments, and write the remaining text to stdout. The text output should "simulate" HTML characteristics such as list intendment, breaks, paragraphs and as much that can be simulated by plain text characters. Naturally, it is assumed that you cannot simulate font characteristics, such as bold and italic.

Grammar:

You need to ensure that the input minimally follows the grammar rules listed below. You are expected to implemented additional tags and productions (details later).

Lexical Rules

A tag is a sequence of characters of the form <S>, where S is a sequence of printable characters not beginning with a white space character and not containing any ">" characters. Our grammar recognizes the following tags:

TABLE_START : <table>

TABLE_END : </table>

BF_START : <b>

BF_END : </b>

IT_START : <i>

IT_END : </i>

UL_START : <ul>

UL_END : </ul>

OL_START : <ol>

OL_END : </ol>

LI_START : <li>

LI_END : </li>

Additionally, the token TAG will match any tag that is not one of the tags listed above (or any tags specified explicitly later on in this document), and the token TEXT will match any (single) character that is not within a tag or comment.

Syntax Rules:

Syntax rules are made up of tokens and nonterminals. A token denotes one or more related strings that are matched by the scanner (e.g., "identifier", "integer constant"). A nonterminal denotes a set of strings with similar syntax structure (e.g., "declaration", "while loop"). In the rules below, tokens are written in teletype font, like this; nonterminals are written in italics, like this. The symbol

denotes the empty sequence.

A syntax rule consists of a left hand side and a right hand side, separated by a colon ":". The left hand side is a nonterminal whose structure is defined by the rule. A right hand side consists of a set of alternatives, separated by "|". Each alternative is a sequence (possibly empty) of tokens and nonterminals.

Html : item Html

|

item : TABLE_START Html TABLE_END
| BF_START Html BF_END

| IT_START Html IT_END

| List

| Other

List : UL_START ItemList UL_END

| OL_START ItemList OL_END

ItemList : OneItem ItemList

|

OneItem : LI_START Html LI_END

Other : TAG
| TEXT

Thus, the rule for the nonterminal Html above consists of two alternatives. The first says that one possible structure for Html is to have something with the structure of item (which is then defined by its own rules), followed by something else which again has the structure of Html; the second says that Html can simply the the empty sequence. (For those of you who have unwound the recursion here in your head, this amounts to saying that Html consists of zero or more items.)

Extension:

You need to extend your lexical and syntax rules to include tags and rules that are part of a table in HTML, specifically you need to defines rules for <CAPTION>, <TR>, and <TD>. See here for a tutorial on HTML tables. Please implement the extension separately, i.e., as a separate lex and yacc file (so that it is easier to grade).

Syntax Errors:

Your program will be expected to deal with errors in a ``reasonable'' way. Error messages should be printed to stderr. They should be specific and should contain enough information (with at least a line number) to allow the user to locate the problems.

Running Your Program: