Home Schedule Reading Projects People
Mac Logo Windows Logo Minix Logo Sun Solaris Logo Linux Logo

Project 1: Project 1: Lex & Yacc - Compiler Tools

Assignment Day January 25, 2012
Due Date February 7, 2012 (about 2 weeks for both parts)

Collaboration Policy - Read Carefully

You must work on this project individually, but you may discuss this assignment with other students in the class and ask and provide help in useful ways, preferable over our email list so we can all benefit from your great ideas. You may consult (but not copy) any outside resources you including books, papers, web sites and people.

If you use resources other than the class materials, indicate what you used along with your answer.

Objective:

The main objective for this assignment, is of-course for you to familiarize yourself with the compiler tools Lex and Yacc (or Flex and Bison). Here are some hints on getting familiar with the tools.

Key Concepts:

  • Lex & Yacc Programming
  • Token Definitions
  • Grammar Generation and Recognition

Tutorial / References

Lex and Yacc Resources:

  • Compact Guide to Lex & Yacc

    http://epaperpress.com/lexandyacc/index.html

Lex and Yacc runs on nike.cs.uga.edu and most UNIX machines.

Background: Overview of Lexical and Parsing Tools

A programming language is a set of strings. A (formal) parser is a tool that takes in a string and returns a TRUE or a FALSE depending on whether the string is in the language or not. The effort that the parser has to go through is an indication of the complexity of the language. We will be concerned with two classes of formal languages: the regular languages and the context free languages.

Regular Languages and Lexical Analysis

A regular language is one for which a finite state machine can serve as a parser. Parsers based on finite state machines are usually called lexical analyzers or lexers. Lexers can process their inputs in a time proportional to the length of the input O(n).

Context Free Languages and Syntactic Analysis

Lexical analyzers are not powerful enough to handle all of the constructs typically found in a programming language. These features normally place the language in the class of context free languages. Tools for analyzing context free languages are typically called syntactic analyzers (or just parsers). In the most general case, an analyzer for a context free language can be built using a finite state machine together with a stack. These machines are called non-deterministic pushdown automata, and they can process their input in time proportional to the cube of the length of the input O(n**3).

Dividing The Work

Designing a parser for a programming language means dealing with the following tradeoff: lexical analyzers are fast but weak; syntactic analyzers are slow but more powerful. The tradeoff that has emerged over the last forty years has two aspects.

Lexical analyzers are responsible for dealing with "tokens" (the words and punctuation of the programming language) and syntactic analyzers are responsible for dealing with "phrases" (the legal ways in which tokens can be combined). Typical classes of tokens comprise comments, whitespace, identifiers, literal constants (e.g. numbers and strings), operators, punctuation and keywords. Syntactic analyzers deal with higher-level constructs such as expressions, statements, declarations, and programs.
The programming languages themselves are normally restricted to be a deterministic subset of the full context free languages. Examples of useful subsets are the LL, LR, SLR, and LALR subsets. Although some expressive power is given up when using one of these subsets, the advantage is that when using them, parsing can be accomplished in linear time O(n) and no complex backtracking is required.

Other Tasks

Getting a TRUE/FALSE answer from a parser is not actually all that useful. There are other tasks that parsers need to perform:

Parsing Tools

By now, you should be able to guess that parsing is a well-understood task. In fact, it is usually the case that parsers are no longer programmed by hand. Instead, "analyzer generator" tools are used to construct them. For example, the "lex" tool is a lexical-analyzer-generator, and "yacc" is a "syntactic-analyzer-generator". You give these tools a description of a programming language, and they automatically construct a lexer or a parser, respectively. Just as general lexical and syntactic analyzers do more than parse, analyzer generator tools also allow their users to describe how and when to perform other activities such as those described above. They normally do this by providing a way for their users to append instructions in a programming language (C for lex and yacc).

Grammars

Of course, the question then arises, how do you describe the programming language to be parsed. A grammar is a description of a formal language. Just as languages have classes, so too do grammars. For example, a context-free grammar describes a context-free language. And over the last forty years the computer science community has converged on two decisions as to the form that grammars should take in describing programming languages: + Lexical analyzers will be described using regular expressions. + Syntactic analyzers will be described using some form of BNF.

Regular Expressions

Regular expressions are formed from constants and three operations: concatenation, alternation, and iteration. Practical lexical analyzer generators, like lex, add other features to the notation. Please see the lex manual for details. BNF The syntax of programming languages is normally described using extended-BNF (EBNF). "BNF" stands for either "Backus Normal Form" or "Backus-Naur Form." John Backus originally described the notation, and Peter Naur adapted it for use in the Algol-60 report. Both scientists were members of the Algol committee. The notation has since been extended to make it more useful to describe context free languages. BNF consists of two categories of constants: terminals and non-terminals. Terminals denote the actual tokens of the language being described. Non-terminals describe grammatical categories. BNF also features catenation, alternation, and iteration.

Description

This assignment has two parts, the first part involves writing a simple HTML-to-TXT translator that reads from standard input text file and writes to standard output.

The second part embellishes the parser so that it enforces some simple grammatical rules.

In the first part your parser discarded all HTML tags and comments and write the text to a readable text document.

You will only need to be concerned with a simplified version of HTML -- but are encourage to embellish the project beyond the minimum requirements.

Part 1: Just Lexing:

HTML Tags:

The HTML standard defines a wide variety of tags. Since the goal of this assignment is to learn to use compiler front-end tools, we will approximate this with a much simpler definition:

A tag is a sequence of characters of the form <S>, where S is a character sequence that begins with a non-whitespace printable character and does not contain any ">" characters.

Printable characters are specified via the C library function isprint(); whitespace characters are specified via the C library function isspace(). They correspond to the flex character class expressions [:print:] and [:blank:] respectively (see man flex).

According to this definition, each of the following is a tag:

<b>
<br>
<a href="www.cs.uga.edu/~maria">
</b>
<bgh#i u&)by 168 jh>
<!@#$%~%^&*()_-+=][;"';:,.|>

HTML Comments

An HTML comment is a sequence of characters of the form

<!--S-->

where S is a sequence of characters that does not contain the string -->. To simplify the project, we will assume that comments are always terminated, i.e., for every open-comment sequence <!-- there is a corresponding close-comment sequence -->. In other words, your translator should enforce the following requirement:

It is an error to have an end-of-file inside a comment.

Comments should be discarded.

Running Your Program

Your executable program will be called myhtml2txt. It will read input from stdin and write its output to stdout. Thus, to translate an HTML file foo.html to a text file bar.txt, invoke your program as

myhtml2txt < foo.html > bar.txt

Part 2: Lexing and Yaccing:

Functionality:

Your tool should have the following functionality. It should read its input from stdin, ensure that the input follows the grammar rules for our subset of HTML, discard all HTML tags and comments, and write the remaining text to stdout. The text output should "simulate" HTML characteristics such as list intendment, breaks, paragraphs and as much that can be simulated by plain text characters. Naturally, it is assumed that you cannot simulate font characteristics, such as bold and italic.

Grammar:

You need to ensure that the input minimally follows the grammar rules listed below. You are expected to implemented additional tags and productions (details later).

Lexical Rules

A tag is a sequence of characters of the form <S>, where S is a sequence of printable characters not beginning with a white space character and not containing any ">" characters. Our grammar recognizes the following tags:

TABLE_START : <table>
TABLE_END : </table>
BF_START : <b>
BF_END : </b>
IT_START : <i>
IT_END : </i>
UL_START : <ul>
UL_END : </ul>
OL_START : <ol>
OL_END : </ol>
LI_START : <li>
LI_END : </li>

Additionally, the token TAG will match any tag that is not one of the tags listed above (or any tags specified explicitly later on in this document), and the token TEXT will match any (single) character that is not within a tag or comment.

Syntax Rules:

Syntax rules are made up of tokens and nonterminals. A token denotes one or more related strings that are matched by the scanner (e.g., "identifier", "integer constant"). A nonterminal denotes a set of strings with similar syntax structure (e.g., "declaration", "while loop"). In the rules below, tokens are written in teletype font, like this; nonterminals are written in italics, like this. The symbol denotes the empty sequence.

A syntax rule consists of a left hand side and a right hand side, separated by a colon ":". The left hand side is a nonterminal whose structure is defined by the rule. A right hand side consists of a set of alternatives, separated by "|". Each alternative is a sequence (possibly empty) of tokens and nonterminals.

Html : item  Html
|
item : TABLE_START   Html   TABLE_END
| BF_START  Html   BF_END
| IT_START   Html   IT_END
| List
| Other
List : UL_START   ItemList  UL_END
| OL_START   ItemList  OL_END
ItemList : OneItem  ItemList
|  
OneItem : LI_START   Html  LI_END
Other : TAG
| TEXT
Thus, the rule for the nonterminal Html above consists of two alternatives. The first says that one possible structure for Html is to have something with the structure of item (which is then defined by its own rules), followed by something else which again has the structure of Html; the second says that Html can simply the the empty sequence. (For those of you who have unwound the recursion here in your head, this amounts to saying that Html consists of zero or more items.)

Extension:

You need to extend your lexical and syntax rules to include tags and rules that are part of a table in HTML, specifically you need to defines rules for <CAPTION>, <TR>, and <TD>. See here for a tutorial on HTML tables. Please implement the extension separately, i.e., as a separate lex and yacc file (so that it is easier to grade).

Syntax Errors:

Your program will be expected to deal with errors in a ``reasonable'' way. Error messages should be printed to stderr. They should be specific and should contain enough information (with at least a line number) to allow the user to locate the problems.

Running Your Program:


Your executable program will be called myhtml2txt. It will read input from stdin and write its output to stdout. Thus, to translate an HTML file foo.html to a text file bar.txt, invoke your program as

myhtml2txt < foo.html > bar.txt

Other Requirements

It must run on nike. You may develop it in your environment but as a last step make sure it runs on nike.

Submitting:

You need to name the directory of your source code "project1/". You must to include a README.txt file describing how to run and process your program (command line arguments to generate your lexer, parser, compile and run your program (lex, yacc and run).

Submission Process:

  1. Create a directory project1/
  2. You need to use a Makefile that contains the targets:
    • clean
      • Executing the command make clean should delete the *.o files, as well as the executable myhtml2txt, from the current directory.
    • myhtml2txt
      • Executing the command make myhtml2txt should create, in the current directory, an executable file myhtml2txt that implements your HTML-to-text translator from scratch, by invoking the appropriate tools (lex/flex) on the input specifications. This is target should create the basic lexer parser, the parser that does not extend the tags of a table.
    • myXhtml2txt
      • Executing the command make myXhtml2txt should create, in the current directory, an executable file myXhtml2txt that implements your HTML-to-text translator from scratch, by invoking the appropriate tools (lex/flex) on the input specifications. This is target should create the extended lexer parser, the lexer/parser that implements the rules/tags of a table.
  3. Include example html files that you tested with your program.
  4. Put all the materials needed (all lex/yacc files) in the above directory -- you also need to include a README.txt (or README.html) file that specifies your lexical and Syntax rules of your html table.
  5. Submit via the 'submit' command (while on atlas.cs.uga.edu)

{nike:maria} submit project1 cs4500

What you need to submit:

project1/

Makefile
myhtml2txt.l
myhtml2txt.y

myXhtml2txt.l
myXhtml2txt.y

. (x-tra files if needed, must be listed in README.txt)
.
README.txt how you lex/yacc/run/compile the program

 

When grading we will check for these: